How to aggregate an accumulative list in Pandas dataframe?

Question

I have a Pandas dataframe like this one below.

    +--------+-------------------+
    | worker |  day  |    tasks  |
    +--------+-------------------+
    |   A    |   2   | 'read'    |
    |   A    |   9   | 'write'   |
    |   B    |   1   | 'read'    |
    |   B    |   2   | 'write'   |
    |   B    |   4   | 'execute' |
    +--------+-------------------+

I need to group dataframe by [col_1, day] and gain an acummulated list of tasks.
e.g. if worker 'A' has the task 'read' in day 2, then the tasks for day 9 should be ['read', 'write'] list.

The desired dataframe looks like this:

    +--------+-------------------------------------+
    | worker |  day  |          aggregation        |
    +--------+-------------------------------------+
    |   A    |   2   | ['read']                    |
    |   A    |   9   | ['read', 'write']           |
    |   B    |   1   | ['read']                    |
    |   B    |   2   | ['read', 'write']           |
    |   B    |   4   | ['read', 'write', 'execute']|
    +--------+-------------------------------------+

I tried to use cumsum

df = df.groupby(['worker', 'day'])['tasks'].apply(list).reset_index(name='aggregation').cumsum()

But it aggregates an accumulation for all the columns and not inside the groups for certain column.

cs95 · Accepted Answer

One straightforward method (although not ideal in terms of performance — then again you're storing lists in columns so your mileage will vary) is to do a groupby and cumsum with lists.

df['tasks'].map(lambda x: [x]).groupby(df['worker']).apply(pd.Series.cumsum) 

0                    [read]
1             [read, write]
2                    [read]
3             [read, write]
4    [read, write, execute]
Name: tasks, dtype: object

Or, very similarly,

(df.assign(tasks=df['tasks'].map(lambda x: [x]))
   .groupby('worker')['tasks']
   .apply(pd.Series.cumsum))

0                    [read]
1             [read, write]
2                    [read]
3             [read, write]
4    [read, write, execute]
Name: tasks, dtype: object

Since you're cumsumming objects (and not numeric data), we need to apply pd.Series.cumsum instead of calling the cythonized GroupBy.cumsum which chokes on python lists.

How to aggregate an accumulative list in Pandas dataframe?

Answers (1)

Related Questions