babz
babz

Reputation: 479

Groupby and transpose in pandas, python

Dataframe have

ID  col  col2   col3   col4

1   A    50      S      1
1   A    52      M      4
1   B    45      N      8
1   C    18      S      7

Dataframe want

ID  col  colA   colB   colC   colD   colE   colF

1   A    50     52      S      M       1      4
1   B    45     NULL    N     NULL     8     NULL
1   C    18     NULL    S     NULL     7     NULL

I want 1 line per unique ID+col (groupby ID and col). If there are multiple entries per ID+col (max can be 2, no more) then put the first value of col2 in colA and second value in colB, put the first value of col3 in colC and second value in colD, put the first value of col4 in colE and second value in colF. If there is only one entry per ID+col then for col2 put the value in colA and colB is null etc.

I tried to first create a counter:

df['COUNT'] = df.groupby(['ID','col']).cumcount()+1

From here I was thinking of just adding a column to say

if count=1 then df['colA']=df.col2
if count=2 then df['colB']=df.col2

.. but this will still result in the same number of rows as the original df.

Upvotes: 4

Views: 4735

Answers (3)

Rohit Khanna
Rohit Khanna

Reputation: 57

Not sure if this is what you looking for, but it renders the same result you are looking for. Please note I am using multiple aggregate function on same column and thus using ravel function to flatten the dataframe columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':[1,1,1,1], 
                  'Col1':['A','A','B','C'],
                 'Col2':[50,52,45,18]})

df = df.groupby(['ID','Col1']).agg({'Col2':['first','last']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
df = df.reset_index()
df['Col2_last'] = np.where(df.Col2_first == df.Col2_last, float('nan'), df.Col2_last)

print(df)

Upvotes: 1

jezrael
jezrael

Reputation: 862661

I think need set_index with unstack:

df['COUNT'] = df.groupby(['ID','col']).cumcount()+1

df = df.set_index(['ID','col', 'COUNT'])['col2'].unstack().add_prefix('col').reset_index()
print (df)
COUNT  ID col  col1  col2
0       1   A  50.0  52.0
1       1   B  45.0   NaN
2       1   C  18.0   NaN

Or:

c = df.groupby(['ID','col']).cumcount()+1

df = df.set_index(['ID','col', c])['col2'].unstack().add_prefix('col').reset_index()
print (df)
   ID col  col1  col2
0   1   A  50.0  52.0
1   1   B  45.0   NaN
2   1   C  18.0   NaN

EDIT:

For multiple columns is solution a bit changed, because working with MultiIndex in columns:

df['COUNT'] = (df.groupby(['ID','col']).cumcount()+1).astype(str)

#remove col2
df = df.set_index(['ID','col', 'COUNT']).unstack()
#flatten Multiindex
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   ID col  col2_1  col2_2 col3_1 col3_2  col4_1  col4_2
0   1   A    50.0    52.0      S      M     1.0     4.0
1   1   B    45.0     NaN      N   None     8.0     NaN
2   1   C    18.0     NaN      S   None     7.0     NaN

Upvotes: 3

BENY
BENY

Reputation: 323236

You can using groupby with apply(pd.Series)

df.groupby(['ID','col']).col2.apply(list).apply(pd.Series).add_prefix('col').reset_index()
Out[404]: 
   ID col  col0  col1
0   1   A  50.0  52.0
1   1   B  45.0   NaN
2   1   C  18.0   NaN

Upvotes: 2

Related Questions