Lemisourd
Lemisourd

Reputation: 145

Pandas Dataframe transformation - Understanding problems with functions I should use and logic I should opt for

I've got a hard problem with transforming a dataframe into another one.

I don't know what functions I should use to do what I want. I had some ideas that didn't work at all.

For example, I do not understand how I should use append (or if I should use it or something else) to do what I want.

Here is my original dataframe:

df1 = pd.DataFrame({
            'Key': ['K0', 'K1', 'K2'],
            'X0': ['a','b','a'],
            'Y0': ['c','d','c'],
            'X1': ['e','f','f'],
            'Y1': ['g','h','h']
})

    Key X0  Y0  X1  Y1
0   K0  a   c   e   g
1   K1  b   d   f   h
2   K2  a   c   f   h

This dataframe represents every links attached to an ID in column Key. Links are following each other : X0->Y0 is the father of X1->Y1.

It's easy to read, and the real dataframe I'm working with has 6500 rows by 21 columns that represents a tree of links. So this dataframe has an end to end links logic.

I want to transform it into another one that has a unitary links and ID logic (because it's a tree of links, some unitary links may be part of multiple end to end links)

So I want to get each individual links into X->Y and I need to get the list of the Keys attached to each unitary links into Keys.

And here is what I want :

df3 = pd.DataFrame({
    'Key':[['K0','K2'],'K1','K0',['K1','K2']],
    'X':['a','b','e','f'],
    'Y':['c','d','g','h']
})

    Key         X   Y
0   [K0, K2]    a   c
1   K1          b   d
2   K0          e   g
3   [K1, K2]    f   h

To do this, I first have to combine X0 and X1 into a unique X column, idem for Y0 and Y1 into a unique Y column. At the same time I need to keep the keys attached to the links. This first transformation leads to a new dataframe, containing all the original information with duplicates which I will deal with after to obtain df3.

Here is the transition dataframe :

df2 = pd.DataFrame({
    'Key':['K0','K1','K2','K0','K1','K2'],
    'X':['a','b','a','e','f','f'],
    'Y':['c','d','c','g','h','h']
})
    Key X   Y
0   K0  a   c
1   K1  b   d
2   K2  a   c
3   K0  e   g
4   K1  f   h
5   K2  f   h

Transition from df1 to df2

For now, I did this to put X0,X1 and Y0,Y1 into X and Y :

Key = pd.Series(dtype=str)
X = pd.Series(dtype=str)
Y = pd.Series(dtype=str)
for i in df1.columns:
    if 'K' in i:
        Key = Key.append(df1[i], ignore_index=True)
    elif 'X' in i:
        X = X.append(df1[i], ignore_index=True)
    elif 'Y' in i:
        Y = Y.append(df1[i], ignore_index=True)
0    K0
1    K1
2    K2
dtype: object
0    a
1    b
2    a
3    e
4    f
5    f
dtype: object
0    c
1    d
2    c
3    g
4    h
5    h
dtype: object

But I do not know how to get the keys to keep them in front of the right links.

Also, I do this to construct df2, but it's weird and I do not understand how I should do it :

df2 = pd.DataFrame({
    'Key':Key,
    'X':X,
    'Y':Y
})
    Key X   Y
0   K0  a   c
1   K1  b   d
2   K2  a   c
3   NaN e   g
4   NaN f   h
5   NaN f   h

I tried to use append, to combine the X0,X1 and Y0,Y1 columns directly into df2, but it turns out to be a complete mess, not filling df2 columns with df1 columns content. I also tried to use append to put the Series Key, X and Y directly into df2, but it gave me X and Y in rows instead of columns.

In short, I'm quite lost with it. I know there may be a lot to program to take df1, turn in into df2 and then into df3. I'm not asking for you to solve the problem for me, but I really need help about the functions I should use or the logic I should put in place to achieve my goal.

Upvotes: 0

Views: 106

Answers (1)

Rodalm
Rodalm

Reputation: 5503

To convert df1 to df2 you want to look into pandas.wide_to_long.

https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html

>>> df2 = pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
>>> df2

         X  Y
Key num      
K0  0    a  c
K1  0    b  d
K2  0    a  c
K0  1    e  g
K1  1    f  h
K2  1    f  h

You can drop the unwanted level "num" from the index using droplevel and turn the index level "Key" into a column using reset_index. Chaining everything:

>>> df2 = (
       pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
         .droplevel(level='num')
         .reset_index()
)  
>>> df2

  Key  X  Y
0  K0  a  c
1  K1  b  d
2  K2  a  c
3  K0  e  g
4  K1  f  h
5  K2  f  h

Finally, to get df3 you just need to group df2 by "X" and "Y", and aggregate the "Key" groups into lists.

>>> df3 = df2.groupby(['X','Y'], as_index=False).agg(list)
>>> df3

   X  Y       Key
0  a  c  [K0, K2]
1  b  d      [K1]
2  e  g      [K0]
3  f  h  [K1, K2]

If you don't want single keys to be lists you can do this instead

>>> df3 = (
        df2.groupby(['X','Y'], as_index=False)
           .agg(lambda g: g.iloc[0] if len(g) == 1 else list(g))
)
>>> df3

   X  Y       Key
0  a  c  [K0, K2]
1  b  d        K1
2  e  g        K0
3  f  h  [K1, K2]

Upvotes: 1

Related Questions