Danish
Danish

Reputation: 2871

Groupby filter, based on consecutive sequence sorted and ID and Date column

I have a dataframe as shown below

    ID  Status  Date    
0   1   F   2017-06-22  
1   1   M   2017-07-22  
2   1   P   2017-10-22  
3   1   F   2018-06-22 
4   1   P   2018-08-22  
5   1   F   2018-10-22  
6   1   F   2019-03-22  
7   2   M   2017-06-29 
8   2   F   2017-09-29 
9   2   F   2018-01-29  
10  2   M   2018-03-29 
11  2   P   2018-08-29  
12  2   M   2018-10-29  
13  2   F   2018-12-29  
14  3   M   2017-03-20  
15  3   F   2018-06-20  
16  3   P   2018-08-20  
17  3   M   2018-10-20  
18  3   F   2018-11-20  
19  3   P   2018-12-20  
20  3   F   2019-03-20  
22  4   M   2017-08-10  
23  4   F   2018-06-10  
24  4   P   2018-08-10  
25  4   F   2018-12-10  
26  4   M   2019-01-10  
27  4   F   2019-06-10  
31  7   M   2017-08-10  
32  7   F   2018-04-10  
33  7   P   2018-08-10  
34  7   F   2018-11-10  
33  7   P   2019-08-10  
34  7   F   2019-10-10   

I would like to filter above data frame into consecutive F-P-F or F-P-F-P-F and so on for each IDs.

After the above step, now our dataframe is as shown below.

    ID  Status  Date      
3   1   F   2018-06-22 
4   1   P   2018-08-22  
5   1   F   2018-10-22   
18  3   F   2018-11-20  
19  3   P   2018-12-20  
20  3   F   2019-03-20   
23  4   F   2018-06-10  
24  4   P   2018-08-10  
25  4   F   2018-12-10    
32  7   F   2018-04-10  
33  7   P   2018-08-10  
34  7   F   2018-11-10
35  7   P   2019-08-10  
36  7   F   2019-10-10   

Then from the above Data frame calculated the Duration column as shown below.

df['Duration'] = df.groupby('ID')['Date'].diff().dt.days

    ID  Status  Date    Duration
0   1   F   2018-06-22  nan
1   1   P   2018-08-22  61.00
2   1   F   2018-10-22  61.00
3   3   F   2018-11-20  nan
4   3   P   2018-12-20  30.00
5   3   F   2019-03-20  90.00
6   4   F   2018-06-10  nan
7   4   P   2018-08-10  61.00
8   4   F   2018-12-10  122.00
9   7   F   2018-04-10  nan
10  7   P   2018-08-10  122.00
11  7   F   2018-11-10  92.00
12  7   P   2019-08-10  273.00
13  7   F   2019-10-10  61.00

From the above DF I want to prepare below data frame.

So the Final expected output is shown below

ID   F-P_Duration  F-F_Duration  P-F_Duration
1    61            122           61
3    30            120           30
4    61            183           122
7_1  122           214           92
7_2  273           334           61

Upvotes: 0

Views: 105

Answers (2)

Scott Boston
Scott Boston

Reputation: 153500

Try this:

import re
import pandas as pd
import numpy as np

df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
print(df)

Output:

    ID Status       Date
0    1      F 2017-06-22
1    1      M 2017-07-22
2    1      P 2017-10-22
3    1      F 2018-06-22
4    1      P 2018-08-22
5    1      F 2018-10-22
6    1      F 2019-03-22
7    2      M 2017-06-29
8    2      F 2017-09-29
9    2      F 2018-01-29
10   2      M 2018-03-29
11   2      P 2018-08-29
12   2      M 2018-10-29
13   2      F 2018-12-29
14   3      M 2017-03-20
15   3      F 2018-06-20
16   3      P 2018-08-20
17   3      M 2018-10-20
18   3      F 2018-11-20
19   3      P 2018-12-20
20   3      F 2019-03-20
22   4      M 2017-08-10
23   4      F 2018-06-10
24   4      P 2018-08-10
25   4      F 2018-12-10
26   4      M 2019-01-10
27   4      F 2019-06-10
31   7      M 2017-08-10
32   7      F 2018-04-10
33   7      P 2018-08-10
34   7      F 2018-11-10
33   7      P 2019-08-10
34   7      F 2019-10-10

My first trick is to use regular expressions to find the location of a substring in the longer string. Using join, I build a string and look for the pattern in that string. Define pattern by putting the longer pattern first.

pattern = "FPFPF|FPF"
def f(x):
    m = re.search(pattern, ''.join(x['Status']))
    return x[m.start():m.end()] if m else None
df1 = df.groupby('ID', group_keys=False).apply(f)
print(df1)

Output:

    ID Status       Date
3    1      F 2018-06-22
4    1      P 2018-08-22
5    1      F 2018-10-22
18   3      F 2018-11-20
19   3      P 2018-12-20
20   3      F 2019-03-20
23   4      F 2018-06-10
24   4      P 2018-08-10
25   4      F 2018-12-10
32   7      F 2018-04-10
33   7      P 2018-08-10
34   7      F 2018-11-10
33   7      P 2019-08-10
34   7      F 2019-10-10

Calculate duration

df1['Duration'] = df1.groupby('ID')['Date'].diff().dt.days
print(df1)

Output:

    ID Status       Date  Duration
3    1      F 2018-06-22       NaN
4    1      P 2018-08-22      61.0
5    1      F 2018-10-22      61.0
18   3      F 2018-11-20       NaN
19   3      P 2018-12-20      30.0
20   3      F 2019-03-20      90.0
23   4      F 2018-06-10       NaN
24   4      P 2018-08-10      61.0
25   4      F 2018-12-10     122.0
32   7      F 2018-04-10       NaN
33   7      P 2018-08-10     122.0
34   7      F 2018-11-10      92.0
33   7      P 2019-08-10     273.0
34   7      F 2019-10-10      61.0

Aggregate during using the latest pandas 0.25 with aggregation relabeling:

df_out = df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
            .agg(F_P_Duration = lambda x: x.iloc[0], 
                 F_F_Duration = 'sum').dropna()
print(df_out)

Output:

           F_P_Duration  F_F_Duration
ID Status                            
1  1               61.0         122.0
3  2               30.0         120.0
4  3               61.0         183.0
7  4              122.0         214.0
   5              273.0         334.0

Update for prior to Pandas 0.25...

df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
            .agg(['first', 'sum', 'last']).dropna()\
            .rename(columns={'first':'F_P_Duration',
                             'sum':'F_F_Duration',
                             'last':'P_F_Duration'})

Output:

           F_P_Duration  F_F_Duration  P_F_Duration
ID Status                                          
1  1               61.0         122.0          61.0
3  2               30.0         120.0          90.0
4  3               61.0         183.0         122.0
7  4              122.0         214.0          92.0
   5              273.0         334.0          61.0

Upvotes: 1

moys
moys

Reputation: 8033

Try if this works for you

mask=df['Status']=='M' 
dff=df.loc[~mask, :].copy()
dff['a']= dff.groupby(["ID"])['Status'].shift()
df2=dff.loc[dff['Status'] != dff['a']]
df2.drop('a' , axis=1, inplace=True)

or just this would give the same result as above

mask=df['Status']=='M'
dff=df.loc[~mask, :].copy()
dff= dff.loc[dff.groupby(["ID"])['Status'].shift()!= dff['Status']]

Upvotes: 0

Related Questions