Reputation: 479
Dataframe have
ID col col2 col3 col4
1 A 50 S 1
1 A 52 M 4
1 B 45 N 8
1 C 18 S 7
Dataframe want
ID col colA colB colC colD colE colF
1 A 50 52 S M 1 4
1 B 45 NULL N NULL 8 NULL
1 C 18 NULL S NULL 7 NULL
I want 1 line per unique ID+col (groupby ID and col). If there are multiple entries per ID+col (max can be 2, no more) then put the first value of col2 in colA and second value in colB, put the first value of col3 in colC and second value in colD, put the first value of col4 in colE and second value in colF. If there is only one entry per ID+col then for col2 put the value in colA and colB is null etc.
I tried to first create a counter:
df['COUNT'] = df.groupby(['ID','col']).cumcount()+1
From here I was thinking of just adding a column to say
if count=1 then df['colA']=df.col2
if count=2 then df['colB']=df.col2
.. but this will still result in the same number of rows as the original df.
Upvotes: 4
Views: 4735
Reputation: 57
Not sure if this is what you looking for, but it renders the same result you are looking for. Please note I am using multiple aggregate function on same column and thus using ravel function to flatten the dataframe columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1],
'Col1':['A','A','B','C'],
'Col2':[50,52,45,18]})
df = df.groupby(['ID','Col1']).agg({'Col2':['first','last']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
df = df.reset_index()
df['Col2_last'] = np.where(df.Col2_first == df.Col2_last, float('nan'), df.Col2_last)
print(df)
Upvotes: 1
Reputation: 862661
I think need set_index
with unstack
:
df['COUNT'] = df.groupby(['ID','col']).cumcount()+1
df = df.set_index(['ID','col', 'COUNT'])['col2'].unstack().add_prefix('col').reset_index()
print (df)
COUNT ID col col1 col2
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
Or:
c = df.groupby(['ID','col']).cumcount()+1
df = df.set_index(['ID','col', c])['col2'].unstack().add_prefix('col').reset_index()
print (df)
ID col col1 col2
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
EDIT:
For multiple columns is solution a bit changed, because working with MultiIndex
in columns:
df['COUNT'] = (df.groupby(['ID','col']).cumcount()+1).astype(str)
#remove col2
df = df.set_index(['ID','col', 'COUNT']).unstack()
#flatten Multiindex
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID col col2_1 col2_2 col3_1 col3_2 col4_1 col4_2
0 1 A 50.0 52.0 S M 1.0 4.0
1 1 B 45.0 NaN N None 8.0 NaN
2 1 C 18.0 NaN S None 7.0 NaN
Upvotes: 3
Reputation: 323236
You can using groupby
with apply(pd.Series)
df.groupby(['ID','col']).col2.apply(list).apply(pd.Series).add_prefix('col').reset_index()
Out[404]:
ID col col0 col1
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
Upvotes: 2