Reputation: 47
I have a Pandas dataframe like this one below.
+--------+-------------------+
| worker | day | tasks |
+--------+-------------------+
| A | 2 | 'read' |
| A | 9 | 'write' |
| B | 1 | 'read' |
| B | 2 | 'write' |
| B | 4 | 'execute' |
+--------+-------------------+
I need to group dataframe by [col_1, day] and gain an acummulated list of tasks.
e.g. if worker 'A' has the task 'read' in day 2, then the tasks for day 9 should be
['read', 'write'] list.
The desired dataframe looks like this:
+--------+-------------------------------------+
| worker | day | aggregation |
+--------+-------------------------------------+
| A | 2 | ['read'] |
| A | 9 | ['read', 'write'] |
| B | 1 | ['read'] |
| B | 2 | ['read', 'write'] |
| B | 4 | ['read', 'write', 'execute']|
+--------+-------------------------------------+
I tried to use cumsum
df = df.groupby(['worker', 'day'])['tasks'].apply(list).reset_index(name='aggregation').cumsum()
But it aggregates an accumulation for all the columns and not inside the groups for certain column.
Upvotes: 2
Views: 138
Reputation: 402814
One straightforward method (although not ideal in terms of performance — then again you're storing lists in columns so your mileage will vary) is to do a groupby
and cumsum
with lists.
df['tasks'].map(lambda x: [x]).groupby(df['worker']).apply(pd.Series.cumsum)
0 [read]
1 [read, write]
2 [read]
3 [read, write]
4 [read, write, execute]
Name: tasks, dtype: object
Or, very similarly,
(df.assign(tasks=df['tasks'].map(lambda x: [x]))
.groupby('worker')['tasks']
.apply(pd.Series.cumsum))
0 [read]
1 [read, write]
2 [read]
3 [read, write]
4 [read, write, execute]
Name: tasks, dtype: object
Since you're cumsumming objects (and not numeric data), we need to apply pd.Series.cumsum
instead of calling the cythonized GroupBy.cumsum
which chokes on python lists.
Upvotes: 3