Groupby filter based on count, calculate duration, penultimate status

Question

I have a dataframe as shown below.

   ID   Status  Date        Cost
0   1   F       2017-06-22  500
1   1   M       2017-07-22  100
2   1   P       2017-10-22  100
3   1   F       2018-06-22  600
4   1   M       2018-08-22  150
5   1   P       2018-10-22  120
6   1   F       2019-03-22  750
7   2   M       2017-06-29  200
8   2   F       2017-09-29  600
9   2   F       2018-01-29  500
10  2   M       2018-03-29  100
11  2   P       2018-08-29  100
12  2   M       2018-10-29  100
13  2   F       2018-12-29  500
14  3   M       2017-03-20  300
15  3   F       2018-06-20  700
16  3   P       2018-08-20  100
17  3   M       2018-10-20  250
18  3   F       2018-11-20  100
19  3   P       2018-12-20  100
20  3   F       2019-03-20  600
21  3   M       2019-05-20  200
22  4   M       2017-08-10  800
23  4   F       2018-06-10  100
24  4   P       2018-08-10  120
25  4   F       2018-10-10  500
26  4   M       2019-01-10  200
27  4   F       2019-06-10  600
28  5   M       2018-10-10  200
29  5   F       2019-06-10  500
30  6   F       2019-06-10  600
31  7   M       2017-08-10  800
32  7   F       2018-06-10  100
33  7   P       2018-08-10  120
34  7   M       2019-01-10  200
35  7   F       2019-06-10  600

where

F = Failure M = Maintenance P = Planned

Step1 - Select the data of IDs which is having at least two status(F or M or P) before the last Failure

Step2 - Ignore the rows if the last raw per ID is not F, expected output after this as shown below.

        ID  Status  Date    Cost
    0   1   F   2017-06-22  500
    1   1   M   2017-07-22  100
    2   1   P   2017-10-22  100
    3   1   F   2018-06-22  600
    4   1   M   2018-08-22  150
    5   1   P   2018-10-22  120
    6   1   F   2019-03-22  750
    7   2   M   2017-06-29  200
    8   2   F   2017-09-29  600
    9   2   F   2018-01-29  500
    10  2   M   2018-03-29  100
    11  2   P   2018-08-29  100
    12  2   M   2018-10-29  100
    13  2   F   2018-12-29  500
    14  3   M   2017-03-20  300
    15  3   F   2018-06-20  700
    16  3   P   2018-08-20  100
    17  3   M   2018-10-20  250
    18  3   F   2018-11-20  100
    19  3   P   2018-12-20  100
    20  3   F   2019-03-20  600
    22  4   M   2017-08-10  800
    23  4   F   2018-06-10  100
    24  4   P   2018-08-10  120
    25  4   F   2018-10-10  500
    26  4   M   2019-01-10  200
    27  4   F   2019-06-10  600
    31  7   M   2017-08-10  800
    32  7   F   2018-06-10  100
    33  7   P   2018-08-10  120
    34  7   M   2019-01-10  200
    35  7   F   2019-06-10  600

Now, for each id last status is failure

Then from the above df I would like to prepare below Data frame

ID  No_of_F  No_of_M  No_of_P  SLS  NoDays_to_SLS NoDays_SLS_to_LS
1   3        2        2        P    487           151
2   3        3        2        M    487           61
3   3        2        2        P    640           90
4   3        1        1        M    518           151
7   2        1        1        M    518           151

SLS = Second Last Status

LS = Last Status

I tried the following code to calculate the duration.

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days

ALollz · Accepted Answer

We can create a mask with gropuby + bfill that allows us to perform both selections.

m = df.Status.eq('F').replace(False, np.NaN).groupby(df.ID).bfill()
df = df.loc[m.groupby(df.ID).transform('sum').gt(2) & m]

    ID Status        Date  Cost
0    1      F  2017-06-22   500
1    1      M  2017-07-22   100
2    1      P  2017-10-22   100
3    1      F  2018-06-22   600
4    1      M  2018-08-22   150
5    1      P  2018-10-22   120
6    1      F  2019-03-22   750
7    2      M  2017-06-29   200
8    2      F  2017-09-29   600
9    2      F  2018-01-29   500
10   2      M  2018-03-29   100
11   2      P  2018-08-29   100
12   2      M  2018-10-29   100
13   2      F  2018-12-29   500
14   3      M  2017-03-20   300
15   3      F  2018-06-20   700
16   3      P  2018-08-20   100
17   3      M  2018-10-20   250
18   3      F  2018-11-20   100
19   3      P  2018-12-20   100
20   3      F  2019-03-20   600
22   4      M  2017-08-10   800
23   4      F  2018-06-10   100
24   4      P  2018-08-10   120
25   4      F  2018-10-10   500
26   4      M  2019-01-10   200
27   4      F  2019-06-10   600
31   7      M  2017-08-10   800
32   7      F  2018-06-10   100
33   7      P  2018-08-10   120
34   7      M  2019-01-10   200
35   7      F  2019-06-10   600

The second part is a bit more annoying. There's almost certainly a smarter way to do this, but here's the straight forward way:

s = df.Date.diff().dt.days

res = pd.concat([df.groupby('ID').Status.value_counts().unstack().add_prefix('No_of_'),
                 df.groupby('ID').Status.apply(lambda x: x.iloc[-2]).to_frame('SLS'),
                 (s.where(s.gt(0)).groupby(df.ID).apply(lambda x: x.cumsum().iloc[-2])
                   .to_frame('NoDays_to_SLS')),
                 s.groupby(df.ID).apply(lambda x: x.iloc[-1]).to_frame('NoDays_SLS_to_LS')],
                axis=1)

Output:

    No_of_F  No_of_M  No_of_P SLS  NoDays_to_SLS  NoDays_SLS_to_LS
ID                                                                
1         3        2        2   P          487.0             151.0
2         3        3        1   M          487.0              61.0
3         3        2        2   P          640.0              90.0
4         3        2        1   M          518.0             151.0
7         2        2        1   M          518.0             151.0

Groupby filter based on count, calculate duration, penultimate status

Answers (2)

Output:

Related Questions