Reputation: 145
I've got a hard problem with transforming a dataframe into another one.
I don't know what functions I should use to do what I want. I had some ideas that didn't work at all.
For example, I do not understand how I should use append (or if I should use it or something else) to do what I want.
Here is my original dataframe:
df1 = pd.DataFrame({
'Key': ['K0', 'K1', 'K2'],
'X0': ['a','b','a'],
'Y0': ['c','d','c'],
'X1': ['e','f','f'],
'Y1': ['g','h','h']
})
Key X0 Y0 X1 Y1
0 K0 a c e g
1 K1 b d f h
2 K2 a c f h
This dataframe represents every links attached to an ID in column Key. Links are following each other : X0->Y0 is the father of X1->Y1.
It's easy to read, and the real dataframe I'm working with has 6500 rows by 21 columns that represents a tree of links. So this dataframe has an end to end links logic.
I want to transform it into another one that has a unitary links and ID logic (because it's a tree of links, some unitary links may be part of multiple end to end links)
So I want to get each individual links into X->Y and I need to get the list of the Keys attached to each unitary links into Keys.
And here is what I want :
df3 = pd.DataFrame({
'Key':[['K0','K2'],'K1','K0',['K1','K2']],
'X':['a','b','e','f'],
'Y':['c','d','g','h']
})
Key X Y
0 [K0, K2] a c
1 K1 b d
2 K0 e g
3 [K1, K2] f h
To do this, I first have to combine X0 and X1 into a unique X column, idem for Y0 and Y1 into a unique Y column. At the same time I need to keep the keys attached to the links. This first transformation leads to a new dataframe, containing all the original information with duplicates which I will deal with after to obtain df3.
Here is the transition dataframe :
df2 = pd.DataFrame({
'Key':['K0','K1','K2','K0','K1','K2'],
'X':['a','b','a','e','f','f'],
'Y':['c','d','c','g','h','h']
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Transition from df1 to df2
For now, I did this to put X0,X1 and Y0,Y1 into X and Y :
Key = pd.Series(dtype=str)
X = pd.Series(dtype=str)
Y = pd.Series(dtype=str)
for i in df1.columns:
if 'K' in i:
Key = Key.append(df1[i], ignore_index=True)
elif 'X' in i:
X = X.append(df1[i], ignore_index=True)
elif 'Y' in i:
Y = Y.append(df1[i], ignore_index=True)
0 K0
1 K1
2 K2
dtype: object
0 a
1 b
2 a
3 e
4 f
5 f
dtype: object
0 c
1 d
2 c
3 g
4 h
5 h
dtype: object
But I do not know how to get the keys to keep them in front of the right links.
Also, I do this to construct df2, but it's weird and I do not understand how I should do it :
df2 = pd.DataFrame({
'Key':Key,
'X':X,
'Y':Y
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 NaN e g
4 NaN f h
5 NaN f h
I tried to use append, to combine the X0,X1 and Y0,Y1 columns directly into df2, but it turns out to be a complete mess, not filling df2 columns with df1 columns content. I also tried to use append to put the Series Key, X and Y directly into df2, but it gave me X and Y in rows instead of columns.
In short, I'm quite lost with it. I know there may be a lot to program to take df1, turn in into df2 and then into df3. I'm not asking for you to solve the problem for me, but I really need help about the functions I should use or the logic I should put in place to achieve my goal.
Upvotes: 0
Views: 106
Reputation: 5503
To convert df1
to df2
you want to look into pandas.wide_to_long
.
https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
>>> df2 = pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
>>> df2
X Y
Key num
K0 0 a c
K1 0 b d
K2 0 a c
K0 1 e g
K1 1 f h
K2 1 f h
You can drop the unwanted level "num" from the index using droplevel
and turn the index level "Key" into a column using reset_index
. Chaining everything:
>>> df2 = (
pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
.droplevel(level='num')
.reset_index()
)
>>> df2
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Finally, to get df3
you just need to group df2
by "X" and "Y", and aggregate the "Key" groups into lists.
>>> df3 = df2.groupby(['X','Y'], as_index=False).agg(list)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d [K1]
2 e g [K0]
3 f h [K1, K2]
If you don't want single keys to be lists you can do this instead
>>> df3 = (
df2.groupby(['X','Y'], as_index=False)
.agg(lambda g: g.iloc[0] if len(g) == 1 else list(g))
)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d K1
2 e g K0
3 f h [K1, K2]
Upvotes: 1