TheChymera
TheChymera

Reputation: 17934

Convert Pandas df with missing entries from long to wide format

I have a Pandas dataframe looking roughly like this (with dt* being datetime objects):

     A    B    C    D 
1  dt1    X    1    1
2 None    Y    4    1
3  dt2    X    2    2
4 None    Y    8    2
5 None    X    3    3
6 None    Y    2    3
7  dt3    X    7    4
8  dt3    Y    1    4

And I would like to convert it into such a shape:

     A    X    y    D
1  dt1    1    4    1
2  dt2    2    8    2   
3 None    3    2    3
4  dt3    7    1    4

It is a given that the datetime objects are either identical, or one is None.

I have tried using the following:

pd.pivot_table(table, index=["D"], columns=["B"], values=["C","A"], aggfunc=lambda x: ''.join(set(x)))

But this just makes the "A" column disappear.

Other variations on the above have also failed:

table = table.pivot(index="D", columns='B')

This one, for instance, creates a hierarchical column dataframe which I have been unable to collapse into the desired structure.

Can you help me out?

Upvotes: 1

Views: 361

Answers (2)

Nickil Maveli
Nickil Maveli

Reputation: 29719

After computing the pivot_table, join the resulting DF with the original DF that gets grouped by D to find the first occurence of A.

piv = df.pivot_table(index=['D'], columns=['B'])['C']
piv.join(df.groupby('D')['A'].first()).reset_index().rename_axis([None], axis=1)

(OR)

df.set_index(['D', 'B'])['C'].unstack(1).join(df.groupby('D')['A'].first()).reset_index()

Both yield:

enter image description here

Upvotes: 1

jezrael
jezrael

Reputation: 863281

You can use:

#sort values by column A
df = df.sort_values('A', ascending=False)
#replace string 'None' to None if necessary
df.A = df.A.replace({'None': None})
#groupby by column D and forward fill first value in each group
df.A = df.groupby('D', sort=False)['A'].ffill()
#reshape by set_index with unstack
df = df.set_index(['A','D', 'B'])['C'].unstack().reset_index()
#reorder columns, remove column name
df = df[['A','X','Y','D']].rename_axis(None, axis=1)
#if necessary replace NaN to None
df.A = df.A.replace({np.nan: None})
#sort by column D
df = df.sort_values('D')
print (df)
      A  X  Y  D
1   dt1  1  4  1
2   dt2  2  8  2
0  None  3  2  3
3   dt3  7  1  4

Upvotes: 1

Related Questions