Riva Dan
Riva Dan

Reputation: 47

How to aggregate an accumulative list in Pandas dataframe?

I have a Pandas dataframe like this one below.

    +--------+-------------------+
    | worker |  day  |    tasks  |
    +--------+-------------------+
    |   A    |   2   | 'read'    |
    |   A    |   9   | 'write'   |
    |   B    |   1   | 'read'    |
    |   B    |   2   | 'write'   |
    |   B    |   4   | 'execute' |
    +--------+-------------------+

I need to group dataframe by [col_1, day] and gain an acummulated list of tasks.
e.g. if worker 'A' has the task 'read' in day 2, then the tasks for day 9 should be ['read', 'write'] list.


The desired dataframe looks like this:

    +--------+-------------------------------------+
    | worker |  day  |          aggregation        |
    +--------+-------------------------------------+
    |   A    |   2   | ['read']                    |
    |   A    |   9   | ['read', 'write']           |
    |   B    |   1   | ['read']                    |
    |   B    |   2   | ['read', 'write']           |
    |   B    |   4   | ['read', 'write', 'execute']|
    +--------+-------------------------------------+

I tried to use cumsum

df = df.groupby(['worker', 'day'])['tasks'].apply(list).reset_index(name='aggregation').cumsum()

But it aggregates an accumulation for all the columns and not inside the groups for certain column.

Upvotes: 2

Views: 138

Answers (1)

cs95
cs95

Reputation: 402814

One straightforward method (although not ideal in terms of performance — then again you're storing lists in columns so your mileage will vary) is to do a groupby and cumsum with lists.

df['tasks'].map(lambda x: [x]).groupby(df['worker']).apply(pd.Series.cumsum) 

0                    [read]
1             [read, write]
2                    [read]
3             [read, write]
4    [read, write, execute]
Name: tasks, dtype: object

Or, very similarly,

(df.assign(tasks=df['tasks'].map(lambda x: [x]))
   .groupby('worker')['tasks']
   .apply(pd.Series.cumsum))

0                    [read]
1             [read, write]
2                    [read]
3             [read, write]
4    [read, write, execute]
Name: tasks, dtype: object

Since you're cumsumming objects (and not numeric data), we need to apply pd.Series.cumsum instead of calling the cythonized GroupBy.cumsum which chokes on python lists.

Upvotes: 3

Related Questions