Reputation: 782
Take the following example pandas DataFrame:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
"start": ["jan1", "jan1", "jan4", "feb17", "jan4", "mar3"],
"end": ["jan3", "jan3", "jan21", "feb17", "jan21", "mar4"],
"duration": [2, 2, 17, 0, 17, 1],
"case_id": ["case1", "case43", "case6", "case1", "case22", "case69"]
})
I want to use a pandas groupby operation on columns start
, end
and duration
to perform two list aggregations on the dataframe:
id
values for each groupcase_id
values for each groupMy desired output would look like this:
start end duration ids cases
jan1 jan3 2 [1, 2] [case1, case43]
jan4 jan21 17 [3, 5] [case6, case22]
feb17 feb17 0 [4] [case1]
mar3 mar4 1 [6] [case69]
How to do this efficiently using pandas groupby?
I know that if I would need only one aggregation I could do it like this:
df = df.groupby(['start', 'end', 'duration'])['id'].apply(list).to_frame()
How to do this for multiple list aggregations? And if there are multiple options, what would be the least time consuming? (the DataFrames I'm transforming are quite large)
Upvotes: 1
Views: 705
Reputation: 13831
You will need to use pandas.groupby.agg
, and specify the columns you want to return as list
.
To lessen the time needed, since you have categorical columns in your data, make sure you use the observed=True
option in your groupby command. This makes sure it only creates lines where an entry is present (more information on this here)
res = df.groupby(['start', 'end', 'duration'],observed=True)[['id','case_id']].agg(list).reset_index().sort_values(by='id')
Output:
res
Out[164]:
start end duration id case_id
1 jan1 jan3 2 [1, 2] [case1, case43]
2 jan4 jan21 17 [3, 5] [case6, case22]
0 feb17 feb17 0 [4] [case1]
3 mar3 mar4 1 [6] [case69]
Assuming that your unique categories are not too many and your dataset is not excessively large, this shouldn't be a problem. Generally, processing strings takes a lot longer than processing numbers, so if this takes too long to run, you could try converting your object
columns to numeric columns and re-doing your groupby
.
Upvotes: 1