Lemisourd
Lemisourd

Reputation: 145

Pandas Dataframe - running through two columns 'Father' and 'Son' to rebuild end-to-end links step by step

I have a long dataframe I need to transform to get a wide one. The long one is :

df = pd.DataFrame({
    'key' : ['E', 'E', 'E', 'E', 'J', 'J', 'J', 'J'],
    'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I'],
    'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J']
})
df

First thing to do, I think, is to group it by key. Then we have to find where those keys are found into the column 'son', it's the end (and last son) of the link I need to rebuild.

To rebuild the link, I need to look for his 'father'. His 'father' needs to be kept as father of final step and, also needs to be found into 'son'.

I need to iterate this until a 'father' cannot be found into the 'son' column, so it's going to be the father_0 of the link.

I think it could be done iterating those steps into a recursive function where the stop case : is 'father' not found in 'son'.

Here is the dataframe I want to get from this :

df1 = pd.DataFrame({
    'key' : ['E', 'J'],
    'father_1' : ['A', 'F'],
    'son_1' : ['B', 'G'],
    'father_2' : ['B', 'G'],
    'son_2' : ['C', 'H'],
    'father_3' : ['C', 'H'],
    'son_3' : ['D', 'I'],
    'father_4' : ['D', 'I'],
    'son_4' : ['E', 'J'],
})
df1

I simplified the problem here with 2 different links of the same depth, but they could be from depth 1 to depth 10 (maybe more but rarely and unpredictably) for a lot of different keys. Here is another example of df with 2 links of different size :

df_ = pd.DataFrame({
    'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
    'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I', 'J'],
    'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J', 'K']
})
df_

df_1 = pd.DataFrame({
    'key' : ['E', 'K'],
    'father_1' : ['A', 'F'],
    'son_1' : ['B', 'G'],
    'father_2' : ['B', 'G'],
    'son_2' : ['C', 'H'],
    'father_3' : ['C', 'H'],
    'son_3' : ['D', 'I'],
    'father_4' : ['D', 'I'],
    'son_4' : ['E', 'J'],
    'father_5' : [np.NaN, 'J'],
    'son_5' : [np.NaN, 'K']
})
df_1 

Then the final step is easy, it's about taking 'father_x' and 'son_x-1' into 'step_x-1' : So the resulting dataframes for these examples would be :

df2 = pd.DataFrame({
    'key' : ['E', 'J'],
    'step_0' : ['A', 'F'],
    'step_1' : ['B', 'G'],
    'step_2' : ['C', 'H'],
    'step_3' : ['D', 'I'],
    'step_4' : ['E', 'J'],
})
df2

df_2 = pd.DataFrame({
    'key' : ['E', 'K'],
    'step_0' : ['A', 'F'],
    'step_1' : ['B', 'G'],
    'step_2' : ['C', 'H'],
    'step_3' : ['D', 'I'],
    'step_4' : ['E', 'J'],
    'step_5' : [np.NaN, 'K']
})
df_2

My concerne is more about the way to aggregate the data from wide to long following the previously given rules into an recursive function.

It's like in a groupby.agg but that I can't just pass a dictionnary into it because the new columns are based on the number of iteration of the recursive function on each key.

Upvotes: 0

Views: 98

Answers (2)

Lemisourd
Lemisourd

Reputation: 145

I found a solution for this specific type of dataframe : where we only have 1 predecessor for all values except root. It also requires using NetworkX. I didn't find a way to do it only using Pandas.

First, we need to build a graph from edgelist :

G = nx.from_pandas_edgelist(df, 'father', 'son', create_using=nx.MultiDiGraph, edge_key = 'key')
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
#write_dot(G,'test.dot')
plt.title('draw_networkx')
pos =graphviz_layout(G, prog='dot')
nx.draw(G, pos, with_labels=True, arrows=True)

For pygraphviz install, please see this question. Then end-to-end links dataframe is built with :

num=0
num_max = len(df.key.drop_duplicates())
m_max = 30
dfy = pd.DataFrame(index=range(num_max),columns=range(m_max))
for n in df.key.drop_duplicates() :
    m = 0
    dfy.iloc[num, m] = n
    while len(list(G.predecessors(dfy.iloc[num,m])))!=0 :
        dfy.iloc[num,m+1] = list(G.predecessors(dfy.iloc[num,m]))[0]
        m+=1
    num+=1
print(dfy)

Output :

   0  1  2  3  4    5    6    7    8    9  ...   
0  E  D  C  B  A  NaN  NaN  NaN  NaN  NaN  ...  
1  K  J  I  H  G    F  NaN  NaN  NaN  NaN  ...

Upvotes: 0

BENY
BENY

Reputation: 323226

Assign the new key with cumcount then we can do pivot

out = df.assign(c = df.groupby('key').cumcount().add(1).astype(str)).pivot('key','c').sort_index(level=1,axis=1)
out.columns = out.columns.map('_'.join)
out
Out[34]: 
    father_1 son_1 father_2 son_2 father_3 son_3 father_4 son_4
key                                                            
E          A     B        B     C        C     D        D     E
J          F     G        G     H        H     I        I     J

Upvotes: 1

Related Questions