Danish
Danish

Reputation: 2871

Groupby sequence count and duration of sequence

I have Data frame as shown below.

   ID   Status      Date    Cost    Duration
0   1   F       2017-06-22  500     nan
1   1   M       2017-07-22  100     30.00
2   1   P       2017-10-22  100     92.00
3   1   F       2018-06-22  600     243.00
4   1   M       2018-08-22  150     61.00
5   1   P       2018-10-22  120     61.00
6   1   F       2019-03-22  750     151.00
7   2   M       2017-06-29  200     nan
8   2   F       2017-09-29  600     92.00
9   2   F       2018-01-29  500     122.00
10  2   M       2018-03-29  100     59.00
11  2   P       2018-08-29  100     153.00
12  2   M       2018-10-29  100     61.00
13  2   F       2018-12-29  500     61.00
14  3   M       2017-03-20  300     nan
15  3   F       2018-06-20  700     457.00
16  3   P       2018-08-20  100     61.00
17  3   M       2018-10-20  250     61.00
18  3   F       2018-11-20  100     31.00
19  3   P       2018-12-20  100     30.00
20  3   F       2019-03-20  600     90.00
22  4   M       2017-08-10  800     nan
23  4   F       2018-06-10  100     304.00
24  4   P       2018-08-10  120     61.00
25  4   F       2018-10-10  500     61.00
26  4   M       2019-01-10  200     92.00
27  4   F       2019-06-10  600     151.00

The data frame is already sorted based on ID and Date

Duration = Number days taken from last Status to current Status for that ID.

I want prepare below Data frame using pandas functions.

ID  Nof_F-F  Nof_F-M  Nof_F-P Nof_M-F Nof_M-M  Nof_M-P Nof_P-F Nof_P-M
1   0        2        0       0       0        2       2       0
2   1        0        0       2       0        1       0       1
3   0        0        2       2       0        0       1       1
4   0        1        1       2       0        0       1       0

where Nof_F-F = Number of consecutive F-F and so on.

Also I would like to prepare following data frame also

    ID  Avg_F-F  Avg_F-M  Avg_F-P Avg_M-F Avg_M-M  Avg_M-P Avg_P-F Avg_P-M
    1   nan      45.5     nan     nan     nan      76.5    197     nan
    2   122      nan      nan     76.5    nan      153     nan     61
    3   nan      nan      45.5    244     nan      nan     90      61
    4   nan      92       61      227.5   nan      nan     61      nan

Avg_F-F = Average days for Consecutive F-F and similarly for all other columns.

Upvotes: 1

Views: 66

Answers (1)

ALollz
ALollz

Reputation: 59579

Since already sorted, use shift +where to create the links. Then groupby + unstack

df['link'] = df.Status.shift().where(df.ID.eq(df.ID.shift())) + '-' + df.Status

df.groupby(['ID', 'link']).agg('size').unstack().fillna(0)
#link  F-F  F-M  F-P  M-F  M-P  P-F  P-M
#ID                                     
#1     0.0  2.0  0.0  0.0  2.0  2.0  0.0
#2     1.0  1.0  0.0  2.0  1.0  0.0  1.0
#3     0.0  0.0  2.0  2.0  0.0  1.0  1.0
#4     0.0  1.0  1.0  2.0  0.0  1.0  0.0

df.groupby(['ID', 'link']).Duration.mean().unstack()
#link    F-F   F-M   F-P    M-F    M-P    P-F   P-M
#ID                                                
#1       NaN  45.5   NaN    NaN   76.5  197.0   NaN
#2     122.0  59.0   NaN   76.5  153.0    NaN  61.0
#3       NaN   NaN  45.5  244.0    NaN   90.0  61.0
#4       NaN  92.0  61.0  227.5    NaN   61.0   NaN

To ensure all links exist in the output use reindex after unstacking. Either form the list yourself or use unique to get all values that exist.

u = df.Status.unique()
cols = [f'{x}-{y}' for x in u for y in u]
#['F-F', 'F-M', 'F-P', 'M-F', 'M-M', 'M-P', 'P-F', 'P-M', 'P-P']

df.groupby(['ID', 'link']).Duration.mean().unstack().reindex(cols, axis=1)

#link    F-F   F-M   F-P    M-F  M-M    M-P    P-F   P-M  P-P
#ID                                                          
#1       NaN  45.5   NaN    NaN  NaN   76.5  197.0   NaN  NaN
#2     122.0  59.0   NaN   76.5  NaN  153.0    NaN  61.0  NaN
#3       NaN   NaN  45.5  244.0  NaN    NaN   90.0  61.0  NaN
#4       NaN  92.0  61.0  227.5  NaN    NaN   61.0   NaN  NaN

Upvotes: 2

Related Questions