Reputation: 2871
I have Data frame as shown below.
ID Status Date Cost Duration
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 P 2017-10-22 100 92.00
3 1 F 2018-06-22 600 243.00
4 1 M 2018-08-22 150 61.00
5 1 P 2018-10-22 120 61.00
6 1 F 2019-03-22 750 151.00
7 2 M 2017-06-29 200 nan
8 2 F 2017-09-29 600 92.00
9 2 F 2018-01-29 500 122.00
10 2 M 2018-03-29 100 59.00
11 2 P 2018-08-29 100 153.00
12 2 M 2018-10-29 100 61.00
13 2 F 2018-12-29 500 61.00
14 3 M 2017-03-20 300 nan
15 3 F 2018-06-20 700 457.00
16 3 P 2018-08-20 100 61.00
17 3 M 2018-10-20 250 61.00
18 3 F 2018-11-20 100 31.00
19 3 P 2018-12-20 100 30.00
20 3 F 2019-03-20 600 90.00
22 4 M 2017-08-10 800 nan
23 4 F 2018-06-10 100 304.00
24 4 P 2018-08-10 120 61.00
25 4 F 2018-10-10 500 61.00
26 4 M 2019-01-10 200 92.00
27 4 F 2019-06-10 600 151.00
The data frame is already sorted based on ID and Date
Duration = Number days taken from last Status to current Status for that ID.
I want prepare below Data frame using pandas functions.
ID Nof_F-F Nof_F-M Nof_F-P Nof_M-F Nof_M-M Nof_M-P Nof_P-F Nof_P-M
1 0 2 0 0 0 2 2 0
2 1 0 0 2 0 1 0 1
3 0 0 2 2 0 0 1 1
4 0 1 1 2 0 0 1 0
where Nof_F-F = Number of consecutive F-F and so on.
Also I would like to prepare following data frame also
ID Avg_F-F Avg_F-M Avg_F-P Avg_M-F Avg_M-M Avg_M-P Avg_P-F Avg_P-M
1 nan 45.5 nan nan nan 76.5 197 nan
2 122 nan nan 76.5 nan 153 nan 61
3 nan nan 45.5 244 nan nan 90 61
4 nan 92 61 227.5 nan nan 61 nan
Avg_F-F = Average days for Consecutive F-F and similarly for all other columns.
Upvotes: 1
Views: 66
Reputation: 59579
Since already sorted, use shift
+where
to create the links. Then groupby
+ unstack
df['link'] = df.Status.shift().where(df.ID.eq(df.ID.shift())) + '-' + df.Status
df.groupby(['ID', 'link']).agg('size').unstack().fillna(0)
#link F-F F-M F-P M-F M-P P-F P-M
#ID
#1 0.0 2.0 0.0 0.0 2.0 2.0 0.0
#2 1.0 1.0 0.0 2.0 1.0 0.0 1.0
#3 0.0 0.0 2.0 2.0 0.0 1.0 1.0
#4 0.0 1.0 1.0 2.0 0.0 1.0 0.0
df.groupby(['ID', 'link']).Duration.mean().unstack()
#link F-F F-M F-P M-F M-P P-F P-M
#ID
#1 NaN 45.5 NaN NaN 76.5 197.0 NaN
#2 122.0 59.0 NaN 76.5 153.0 NaN 61.0
#3 NaN NaN 45.5 244.0 NaN 90.0 61.0
#4 NaN 92.0 61.0 227.5 NaN 61.0 NaN
To ensure all links exist in the output use reindex
after unstacking. Either form the list yourself or use unique
to get all values that exist.
u = df.Status.unique()
cols = [f'{x}-{y}' for x in u for y in u]
#['F-F', 'F-M', 'F-P', 'M-F', 'M-M', 'M-P', 'P-F', 'P-M', 'P-P']
df.groupby(['ID', 'link']).Duration.mean().unstack().reindex(cols, axis=1)
#link F-F F-M F-P M-F M-M M-P P-F P-M P-P
#ID
#1 NaN 45.5 NaN NaN NaN 76.5 197.0 NaN NaN
#2 122.0 59.0 NaN 76.5 NaN 153.0 NaN 61.0 NaN
#3 NaN NaN 45.5 244.0 NaN NaN 90.0 61.0 NaN
#4 NaN 92.0 61.0 227.5 NaN NaN 61.0 NaN NaN
Upvotes: 2