Peter
Peter

Reputation: 782

Pandas goupby and aggregate as list does not work

Below a snippet of my Pandas DataFrame:

    case_id     start_time  end_time    segment_name                                    duration
0   0.0     2016-06-02  2016-06-02  A_Create Application|COMPLETE - A_Submitted|CO...   0
15  0.0     2016-06-02  2016-06-02  A_Submitted|COMPLETE - W_Handle leads|SCHEDULE      0
19  0.0     2016-06-02  2016-06-02  W_Handle leads|SCHEDULE - W_Handle leads|WITHDRAW   0
16  0.0     2016-06-02  2016-06-02  W_Handle leads|WITHDRAW - W_Complete applicati...   0
17  0.0     2016-06-02  2016-06-02  W_Complete application|SCHEDULE - A_Concept|CO...   0
18  0.0     2016-06-02  2016-06-06  A_Concept|COMPLETE - W_Complete application|START   4
24  0.0     2016-06-06  2016-06-06  W_Complete application|START - A_Accepted|COMP...   0
20  0.0     2016-06-06  2016-06-06  A_Accepted|COMPLETE - O_Create Offer|COMPLETE       0
21  0.0     2016-06-06  2016-06-06  O_Create Offer|COMPLETE - O_Created|COMPLETE        0
22  0.0     2016-06-06  2016-06-06  O_Created|COMPLETE - O_Sent (mail and online)|...   0

Now, I want to group this DataFrame using pandas GroupBy on columns segment_name, start_time, end_time and duration, and for each resulting group get a list of case_id's. So something like:

A_Submitted|COMPLETE - A_Accepted|COMPLETE    2016-06-02     2016-06-03    1    [caseid1, caseid45, ...]
O_Create_Offer|COMPLETE - O_Created|Complete  2016-03-05     2016-03-11    6    [caseid35, caseid101, caseid149]

However, everything that I try gives either an error or an empty output. For example:

df = df.groupby(['segment_name', 'start_time', 'end_time', 'duration'], axis=1).apply(list) gives

_

df = df.groupby(['segment_name', 'start_time', 'end_time', 'duration']).apply(list) gives

segment_name                                   start_time  end_time    duration
A_Accepted|COMPLETE - O_Create Offer|COMPLETE  2016-01-02  2016-01-02  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-04  2016-01-04  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-05  2016-01-05  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-06  2016-01-06  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-07  2016-01-07  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-08  2016-01-08  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-09  2016-01-09  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-11  2016-01-11  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-12  2016-01-12  0           [case_id, start_time, end_time, segment_name, ...
                                               2016-01-13  2016-01-13  0           [case_id, start_time, end_time, segment_name, ...
dtype: object

df = df.groupby(['segment_name', 'start_time', 'end_time', 'duration'], axis=1)['case_id'].apply(list) gives

Series([], Name: case_id, dtype: float64)

How to get the desired result?

Upvotes: 0

Views: 53

Answers (1)

Tom
Tom

Reputation: 8800

I think what you want is your last attempt, except without the axis=1. So:

df.groupby(['segment_name', 'start_time', 'end_time', 'duration'])['case_id'].apply(list)

Example:

import pandas as pd

cases = [0] * 10
start = ['01-01-2021'] * 10
end = ['01-02-2021'] * 10
segment = ['A'] * 5 + ['B'] * 5
duration = [3] * 10

df = pd.DataFrame({'case_id':cases,
                   'start_time': start,
                   'end_time': end,
                   'segment_name': segment,
                   'duration': duration})

print(df.groupby(['segment_name', 'start_time', 'end_time', 'duration'])['case_id'].apply(list))

output:

segment_name  start_time  end_time    duration
A             01-01-2021  01-02-2021  3           [0, 0, 0, 0, 0]
B             01-01-2021  01-02-2021  3           [0, 0, 0, 0, 0]
Name: case_id, dtype: object

You could instead apply set if you want only unique case ids.

Upvotes: 1

Related Questions