Reputation: 145
I have a long dataframe I need to transform to get a wide one. The long one is :
df = pd.DataFrame({
'key' : ['E', 'E', 'E', 'E', 'J', 'J', 'J', 'J'],
'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I'],
'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J']
})
df
First thing to do, I think, is to group it by key. Then we have to find where those keys are found into the column 'son', it's the end (and last son) of the link I need to rebuild.
To rebuild the link, I need to look for his 'father'. His 'father' needs to be kept as father of final step and, also needs to be found into 'son'.
I need to iterate this until a 'father' cannot be found into the 'son' column, so it's going to be the father_0 of the link.
I think it could be done iterating those steps into a recursive function where the stop case : is 'father' not found in 'son'.
Here is the dataframe I want to get from this :
df1 = pd.DataFrame({
'key' : ['E', 'J'],
'father_1' : ['A', 'F'],
'son_1' : ['B', 'G'],
'father_2' : ['B', 'G'],
'son_2' : ['C', 'H'],
'father_3' : ['C', 'H'],
'son_3' : ['D', 'I'],
'father_4' : ['D', 'I'],
'son_4' : ['E', 'J'],
})
df1
I simplified the problem here with 2 different links of the same depth, but they could be from depth 1 to depth 10 (maybe more but rarely and unpredictably) for a lot of different keys. Here is another example of df with 2 links of different size :
df_ = pd.DataFrame({
'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I', 'J'],
'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J', 'K']
})
df_
df_1 = pd.DataFrame({
'key' : ['E', 'K'],
'father_1' : ['A', 'F'],
'son_1' : ['B', 'G'],
'father_2' : ['B', 'G'],
'son_2' : ['C', 'H'],
'father_3' : ['C', 'H'],
'son_3' : ['D', 'I'],
'father_4' : ['D', 'I'],
'son_4' : ['E', 'J'],
'father_5' : [np.NaN, 'J'],
'son_5' : [np.NaN, 'K']
})
df_1
Then the final step is easy, it's about taking 'father_x' and 'son_x-1' into 'step_x-1' : So the resulting dataframes for these examples would be :
df2 = pd.DataFrame({
'key' : ['E', 'J'],
'step_0' : ['A', 'F'],
'step_1' : ['B', 'G'],
'step_2' : ['C', 'H'],
'step_3' : ['D', 'I'],
'step_4' : ['E', 'J'],
})
df2
df_2 = pd.DataFrame({
'key' : ['E', 'K'],
'step_0' : ['A', 'F'],
'step_1' : ['B', 'G'],
'step_2' : ['C', 'H'],
'step_3' : ['D', 'I'],
'step_4' : ['E', 'J'],
'step_5' : [np.NaN, 'K']
})
df_2
My concerne is more about the way to aggregate the data from wide to long following the previously given rules into an recursive function.
It's like in a groupby.agg
but that I can't just pass a dictionnary into it because the new columns are based on the number of iteration of the recursive function on each key.
Upvotes: 0
Views: 98
Reputation: 145
I found a solution for this specific type of dataframe : where we only have 1 predecessor for all values except root. It also requires using NetworkX. I didn't find a way to do it only using Pandas.
First, we need to build a graph from edgelist :
G = nx.from_pandas_edgelist(df, 'father', 'son', create_using=nx.MultiDiGraph, edge_key = 'key')
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
#write_dot(G,'test.dot')
plt.title('draw_networkx')
pos =graphviz_layout(G, prog='dot')
nx.draw(G, pos, with_labels=True, arrows=True)
For pygraphviz install, please see this question. Then end-to-end links dataframe is built with :
num=0
num_max = len(df.key.drop_duplicates())
m_max = 30
dfy = pd.DataFrame(index=range(num_max),columns=range(m_max))
for n in df.key.drop_duplicates() :
m = 0
dfy.iloc[num, m] = n
while len(list(G.predecessors(dfy.iloc[num,m])))!=0 :
dfy.iloc[num,m+1] = list(G.predecessors(dfy.iloc[num,m]))[0]
m+=1
num+=1
print(dfy)
Output :
0 1 2 3 4 5 6 7 8 9 ...
0 E D C B A NaN NaN NaN NaN NaN ...
1 K J I H G F NaN NaN NaN NaN ...
Upvotes: 0
Reputation: 323226
Assign the new key with cumcount
then we can do pivot
out = df.assign(c = df.groupby('key').cumcount().add(1).astype(str)).pivot('key','c').sort_index(level=1,axis=1)
out.columns = out.columns.map('_'.join)
out
Out[34]:
father_1 son_1 father_2 son_2 father_3 son_3 father_4 son_4
key
E A B B C C D D E
J F G G H H I I J
Upvotes: 1