How to re-structure this in python pandas? Merge, unstack or what?

Question

Trying to re-structure a data frame with a format like this:

    key   ref name value
0    k1  None   N1     A
1  None    k1   N2     B
2  None    k1   N3     C
3    k2  None   N4     D
4    k3  None   N5     E
5  None    k3   N6     F
6  None    k3   N7     G

# In code
df = pd.DataFrame(columns=['key', 'ref', 'name', 'value'],
                  data=[
                    ['k1',None,'N1','A'],
                    [None,'k1','N2','B'],
                    [None,'k1','N3','C'],
                    ['k2',None,'N4','D'],
                    ['k3',None,'N5','E'],
                    [None,'k3','N6','F'],
                    [None,'k3','N7','G']])

Into this:

  key   ref name value name2 value2 name3 value3
0  k1    k1   N1     A    N2      B    N3      C
1  k2  None   N4     D  None   None  None   None
2  k3    k3   N5     E    N6      F    N7      G

But are struggeling to get it right. 'key' and 'ref' are not indexes above but feel free to elaborate of how to make use of them this way (source is an Excel-file of this format), if that is a part of the solution. The goal is to have names and values mapped accordingly to the example... (key's and ref's will then be discarded)

Tried with merge and stack but can't get it to work properly...

Note to following rules:

Keys in 'key' column are unique (except if emtpy/None)
Refs in 'ref' column are at most 2 of the same value

In other words:

Any 'key' has 0-2 corresponding 'ref'
Any 'ref' matches one, and only one, corresponding 'key'

divingTobi · Accepted Answer

Maybe this is what you need:

import pandas as pd

df = pd.DataFrame(
    columns=["key", "ref", "name", "value"],
    data=[
        ["k1", None, "N1", "A"],
        [None, "k1", "N2", "B"],
        [None, "k1", "N3", "C"],
        ["k2", None, "N4", "D"],
        ["k3", None, "N5", "E"],
        [None, "k3", "N6", "F"],
        [None, "k3", "N7", "G"],
    ],
)

print(df)
ind = df["key"].isna()
df1 = df.loc[~ind]
df2 = df.loc[ind]

combo = (
    df1.merge(df2[["ref", "name", "value"]], left_on="key", right_on="ref", how="left")
    .fillna("")
    .groupby("key")
    .agg(name=pd.NamedAgg("name_y", ":".join), value=pd.NamedAgg("value_y", ":".join))
)

for c in ["name", "value"]:
    dx = combo[c].str.split(":", expand=True).add_prefix(c)
    df1 = df1.merge(dx, left_on="key", right_index=True)

print(df1)

First, we split df into two separate dataframes, depending on whether there is an entry in Column key.

Second, we create a combo dataframe, merging the df1 with a aggregated df2, where the aggregation creates two columns for value and name, in which all entries are joined into a ":"-separated string.

Finally, we use str.split with expand=True to split those strings into separate columns and merge it with df1.

  key   ref name value name0 name1 value0 value1
0  k1  None   N1     A    N2    N3      B      C
3  k2  None   N4     D        None          None
4  k3  None   N5     E    N6    N7      F      G

The None entries are not yet quite right, but that would be easy to fix.

How to re-structure this in python pandas? Merge, unstack or what?

Answers (1)

Related Questions