Reputation: 3379
Say I have the following dataframe:
import pandas as pd
df = pd.DataFrame()
df['ID'] = [1, 1, 1, 2, 2]
df['Type'] = ['A', 'B', 'Q', 'B', 'R']
df['Status'] = [0, 0, 1, 0, 1]
>>> df
ID Type Status
0 1 A 0
1 1 B 0
2 1 Q 1
3 2 B 0
4 2 R 1
>>>
I want to group this dataframe by 'ID' and reshape it so that I have a "Type" variable and a "Status" variable for each item within the group. See below:
Type1 Type2 Type3 Status1 Status2 Status3
ID
1 A B Q 0 0 1
2 B R NaN 0 1 NaN
The number of rows in my output dataframe will depend on the max number of records in any one group of IDs.
I believe that the pivot function is what I want to use here. However, it calls for a "columns" parameter which I believe should be an ID of each item within each group. I have a very clunky way of calculating this, which I appreciate advice on a better way of doing this:
>>> g=df.groupby('ID')
>>> df['IDinGroup']=[item for sublist in [range(1,len(i[1])+1) for i in g] for item in sublist]
>>> df
ID Type Status IDinGroup
0 1 A 0 1
1 1 B 0 2
2 1 Q 1 3
3 2 B 0 1
4 2 R 1 2
>>>
Then I can loop through the 'Type' and 'Status' variables and pivot each and then merge them back together:
>>> ListOfValues=[]
>>> for ValueCol in ['Type','Status']:
... f=df.pivot(index='ID',columns='IDinGroup',values=ValueCol)
... f.columns=[ValueCol+str(Col) for Col in f.columns]
... f.columns.name=None
... ListOfValues.append(f)
...
>>> pd.concat(ListOfValues,1)
Type1 Type2 Type3 Status1 Status2 Status3
ID
1 A B Q 0 0 1
2 B R NaN 0 1 NaN
>>>
Am I taking the correct approach here? And if so, what is a better way to calculate the "columns" parameter for the pivot function? (the id of each item within each group)
Upvotes: 0
Views: 51
Reputation: 97301
Try this:
import pandas as pd
df = pd.DataFrame()
df['ID'] = [1, 1, 1, 2, 2]
df['Type'] = ['A', 'B', 'Q', 'B', 'R']
df['Status'] = [0, 0, 1, 0, 1]
g = df.groupby("ID")[["Type","Status"]]
df2 = g.apply(pd.DataFrame.reset_index, drop=True).unstack()
Here is the result:
Type Status
0 1 2 0 1 2
ID
1 A B Q 0 0 1
2 B R NaN 0 1 NaN
The columns is a MultiIndex, if you want to flat it:
df2.columns = [l0 + str(l1+1) for l0, l1 in df2.columns]
the output:
Type1 Type2 Type3 Status1 Status2 Status3
ID
1 A B Q 0 0 1
2 B R NaN 0 1 NaN
Upvotes: 1