Reputation: 2871
I have a dataframe as shown below
ID Status Date
0 1 F 2017-06-22
1 1 M 2017-07-22
2 1 P 2017-10-22
3 1 F 2018-06-22
4 1 P 2018-08-22
5 1 F 2018-10-22
6 1 F 2019-03-22
7 2 M 2017-06-29
8 2 F 2017-09-29
9 2 F 2018-01-29
10 2 M 2018-03-29
11 2 P 2018-08-29
12 2 M 2018-10-29
13 2 F 2018-12-29
14 3 M 2017-03-20
15 3 F 2018-06-20
16 3 P 2018-08-20
17 3 M 2018-10-20
18 3 F 2018-11-20
19 3 P 2018-12-20
20 3 F 2019-03-20
22 4 M 2017-08-10
23 4 F 2018-06-10
24 4 P 2018-08-10
25 4 F 2018-12-10
26 4 M 2019-01-10
27 4 F 2019-06-10
31 7 M 2017-08-10
32 7 F 2018-04-10
33 7 P 2018-08-10
34 7 F 2018-11-10
33 7 P 2019-08-10
34 7 F 2019-10-10
I would like to filter above data frame into consecutive F-P-F or F-P-F-P-F and so on for each IDs.
After the above step, now our dataframe is as shown below.
ID Status Date
3 1 F 2018-06-22
4 1 P 2018-08-22
5 1 F 2018-10-22
18 3 F 2018-11-20
19 3 P 2018-12-20
20 3 F 2019-03-20
23 4 F 2018-06-10
24 4 P 2018-08-10
25 4 F 2018-12-10
32 7 F 2018-04-10
33 7 P 2018-08-10
34 7 F 2018-11-10
35 7 P 2019-08-10
36 7 F 2019-10-10
Then from the above Data frame calculated the Duration column as shown below.
df['Duration'] = df.groupby('ID')['Date'].diff().dt.days
ID Status Date Duration
0 1 F 2018-06-22 nan
1 1 P 2018-08-22 61.00
2 1 F 2018-10-22 61.00
3 3 F 2018-11-20 nan
4 3 P 2018-12-20 30.00
5 3 F 2019-03-20 90.00
6 4 F 2018-06-10 nan
7 4 P 2018-08-10 61.00
8 4 F 2018-12-10 122.00
9 7 F 2018-04-10 nan
10 7 P 2018-08-10 122.00
11 7 F 2018-11-10 92.00
12 7 P 2019-08-10 273.00
13 7 F 2019-10-10 61.00
From the above DF I want to prepare below data frame.
So the Final expected output is shown below
ID F-P_Duration F-F_Duration P-F_Duration
1 61 122 61
3 30 120 30
4 61 183 122
7_1 122 214 92
7_2 273 334 61
Upvotes: 0
Views: 105
Reputation: 153500
Try this:
import re
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Output:
ID Status Date
0 1 F 2017-06-22
1 1 M 2017-07-22
2 1 P 2017-10-22
3 1 F 2018-06-22
4 1 P 2018-08-22
5 1 F 2018-10-22
6 1 F 2019-03-22
7 2 M 2017-06-29
8 2 F 2017-09-29
9 2 F 2018-01-29
10 2 M 2018-03-29
11 2 P 2018-08-29
12 2 M 2018-10-29
13 2 F 2018-12-29
14 3 M 2017-03-20
15 3 F 2018-06-20
16 3 P 2018-08-20
17 3 M 2018-10-20
18 3 F 2018-11-20
19 3 P 2018-12-20
20 3 F 2019-03-20
22 4 M 2017-08-10
23 4 F 2018-06-10
24 4 P 2018-08-10
25 4 F 2018-12-10
26 4 M 2019-01-10
27 4 F 2019-06-10
31 7 M 2017-08-10
32 7 F 2018-04-10
33 7 P 2018-08-10
34 7 F 2018-11-10
33 7 P 2019-08-10
34 7 F 2019-10-10
My first trick is to use regular expressions to find the location of a substring in the longer string.
Using join
, I build a string and look for the pattern in that string. Define pattern by putting the longer pattern first.
pattern = "FPFPF|FPF"
def f(x):
m = re.search(pattern, ''.join(x['Status']))
return x[m.start():m.end()] if m else None
df1 = df.groupby('ID', group_keys=False).apply(f)
print(df1)
Output:
ID Status Date
3 1 F 2018-06-22
4 1 P 2018-08-22
5 1 F 2018-10-22
18 3 F 2018-11-20
19 3 P 2018-12-20
20 3 F 2019-03-20
23 4 F 2018-06-10
24 4 P 2018-08-10
25 4 F 2018-12-10
32 7 F 2018-04-10
33 7 P 2018-08-10
34 7 F 2018-11-10
33 7 P 2019-08-10
34 7 F 2019-10-10
Calculate duration
df1['Duration'] = df1.groupby('ID')['Date'].diff().dt.days
print(df1)
Output:
ID Status Date Duration
3 1 F 2018-06-22 NaN
4 1 P 2018-08-22 61.0
5 1 F 2018-10-22 61.0
18 3 F 2018-11-20 NaN
19 3 P 2018-12-20 30.0
20 3 F 2019-03-20 90.0
23 4 F 2018-06-10 NaN
24 4 P 2018-08-10 61.0
25 4 F 2018-12-10 122.0
32 7 F 2018-04-10 NaN
33 7 P 2018-08-10 122.0
34 7 F 2018-11-10 92.0
33 7 P 2019-08-10 273.0
34 7 F 2019-10-10 61.0
Aggregate during using the latest pandas 0.25 with aggregation relabeling:
df_out = df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
.agg(F_P_Duration = lambda x: x.iloc[0],
F_F_Duration = 'sum').dropna()
print(df_out)
Output:
F_P_Duration F_F_Duration
ID Status
1 1 61.0 122.0
3 2 30.0 120.0
4 3 61.0 183.0
7 4 122.0 214.0
5 273.0 334.0
Update for prior to Pandas 0.25...
df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
.agg(['first', 'sum', 'last']).dropna()\
.rename(columns={'first':'F_P_Duration',
'sum':'F_F_Duration',
'last':'P_F_Duration'})
Output:
F_P_Duration F_F_Duration P_F_Duration
ID Status
1 1 61.0 122.0 61.0
3 2 30.0 120.0 90.0
4 3 61.0 183.0 122.0
7 4 122.0 214.0 92.0
5 273.0 334.0 61.0
Upvotes: 1
Reputation: 8033
Try if this works for you
mask=df['Status']=='M'
dff=df.loc[~mask, :].copy()
dff['a']= dff.groupby(["ID"])['Status'].shift()
df2=dff.loc[dff['Status'] != dff['a']]
df2.drop('a' , axis=1, inplace=True)
or just this would give the same result as above
mask=df['Status']=='M'
dff=df.loc[~mask, :].copy()
dff= dff.loc[dff.groupby(["ID"])['Status'].shift()!= dff['Status']]
Upvotes: 0