Reputation: 1851
I have a pandas dataframe as below. For each 'Id' I can have multiple 'Names' and 'Sub-ids'.
Id NAME SUB_ID
276956 A 5933
276956 B 5934
276956 C 5935
287266 D 1589
I want to condense the dataframe such that there is only one row for each 'Id' and all the 'names' and 'sub_ids' under each 'Id' appear as a single set on that row:
Id NAME SUB_ID
276956 set(A,B,C) set(5933,5934,5935)
287266 set(D) set(1589)
I tried to groupby 'Id' and then aggregate over all the other columns:
df.groupby('Id').agg(lambda x: set(x))
But in doing so the resulting dataframe does not have the 'Id' column. When you do groupby the 'Id' is returned as the first value of the tuple but I guess when you aggregate that is lost. Is there a way to get the dataframe that I am looking for: to groupby and aggregate without losing the column which was grouped.
Upvotes: 29
Views: 41607
Reputation: 32095
If you don't want the groupby as an index, there is an argument for it to avoid further reset:
df.groupby('Id', as_index=False).agg(lambda x: set(x))
Upvotes: 33
Reputation: 37930
The groupby column becomes the index. You can simply reset the index to get it back:
In [4]: df.groupby('Id').agg(lambda x: set(x)).reset_index()
Out[4]:
Id NAME SUB_ID
0 276956 {A, C, B} {5933, 5934, 5935}
1 287266 {D} {1589}
Upvotes: 15