Grouby filter, count each category, duration and identify second last based on date

Question

I have a data frame as shown below.

   ID   Status  Date        Cost
0   1   F       2017-06-22  500
1   1   M       2017-07-22  100
2   1   P       2017-10-22  100
3   1   F       2018-06-22  600
4   1   M       2018-08-22  150
5   1   P       2018-10-22  120
6   1   F       2019-03-22  750
7   2   M       2017-06-29  200
8   2   F       2017-09-29  600
9   2   F       2018-01-29  500
10  2   M       2018-03-29  100
11  2   P       2018-08-29  100
12  2   M       2018-10-29  100
13  2   F       2018-12-29  500
14  3   M       2017-03-20  300
15  3   F       2018-06-20  700
16  3   P       2018-08-20  100
17  3   M       2018-10-20  250
18  3   F       2018-11-20  100
19  3   P       2018-12-20  100
20  3   F       2019-03-20  600
21  3   M       2019-05-20  200
22  4   M       2017-08-10  800
23  4   F       2018-06-10  100
24  4   P       2018-08-10  120
25  4   F       2018-10-10  500
26  4   M       2019-01-10  200
27  4   F       2019-06-10  600

where

F=Failure,

M=Maintenance

P = Planned

Step1 - Ignore the rows if the last raw per ID is not F, expected out put after this as shown below.

ID  Status  Date        Cost
0   1   F   2017-06-22  500
1   1   M   2017-07-22  100
2   1   P   2017-10-22  100
3   1   F   2018-06-22  600
4   1   M   2018-08-22  150
5   1   P   2018-10-22  120
6   1   F   2019-03-22  750
7   2   M   2017-06-29  200
8   2   F   2017-09-29  600
9   2   F   2018-01-29  500
10  2   M   2018-03-29  100
11  2   P   2018-08-29  100
12  2   M   2018-10-29  100
13  2   F   2018-12-29  500
14  3   M   2017-03-20  300
15  3   F   2018-06-20  700
16  3   P   2018-08-20  100
17  3   M   2018-10-20  250
18  3   F   2018-11-20  100
19  3   P   2018-12-20  100
20  3   F   2019-03-20  600
22  4   M   2017-08-10  800
23  4   F   2018-06-10  100
24  4   P   2018-08-10  120
25  4   F   2018-10-10  500
26  4   M   2019-01-10  200
27  4   F   2019-06-10  600

Now, for each id last status is failure

Then from the above df I would like to prepare below Data frame

ID  No_of_F  No_of_M  No_of_P  SLS  NoDays_to_SLS NoDays_SLS_to_LS
1   3        2        2        P    487           151
2   3        3        2        M    487           61
3   3        2        2        P    640           90
4   3        1        1        M    518           151

SLS = Second Last Status

LS = Last Status

I tried following code to calculate the duration.

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days


ID  Status  Date    Cost    D
0   1   F   2017-06-22  500 nan
1   1   M   2017-07-22  100 30.00
2   1   P   2017-10-22  100 92.00
3   1   F   2018-06-22  600 243.00
4   1   M   2018-08-22  150 61.00
5   1   P   2018-10-22  120 61.00
6   1   F   2019-03-22  750 151.00
7   2   M   2017-06-29  200 nan
8   2   F   2017-09-29  600 92.00
9   2   F   2018-01-29  500 122.00
10  2   M   2018-03-29  100 59.00
11  2   P   2018-08-29  100 153.00
12  2   M   2018-10-29  100 61.00
13  2   F   2018-12-29  500 61.00
14  3   M   2017-03-20  300 nan
15  3   F   2018-06-20  700 457.00
16  3   P   2018-08-20  100 61.00
17  3   M   2018-10-20  250 61.00
18  3   F   2018-11-20  100 31.00
19  3   P   2018-12-20  100 30.00
20  3   F   2019-03-20  600 90.00
21  3   M   2019-05-20  200 61.00
22  4   M   2017-08-10  800 nan
23  4   F   2018-06-10  100 304.00
24  4   P   2018-08-10  120 61.00
25  4   F   2018-10-10  500 61.00
26  4   M   2019-01-10  200 92.00
27  4   F   2019-06-10  600 151.00

jottbe · Accepted Answer

Another approach. It doesn't require the last row to be a 'F', so no .iloc[-2] is required:

# the following two lines are only required if the date column is a string
# and the dataframe is not already sorted
df['Date']= df['Date'].astype('datetime64')
df.sort_values(['ID', 'Date'], axis='index', ascending=True, inplace=True)

# add some auxillary columns
df['aux_date']= df['Date'].where(df['Status'] != 'F', pd.NaT)
df= pd.concat([df, pd.get_dummies(df['Status'])], axis='columns')

# calculate the relevant dates and the sums of the statuses
grouped= df.groupby('ID')
aggregated= grouped.agg({'Date': 'min', 'aux_date': 'max', 'No_of_F': 'sum', 'No_of_M': 'sum', 'No_of_P': 'sum'})
aggregated['max_date']= grouped['Date'].apply(max)

# calculate the date differences
aggregated.loc[~aggregated['aux_date'].isnull(), 'NoDays_to_SLS']=    aggregated['aux_date'] - aggregated['Date']
aggregated.loc[~aggregated['aux_date'].isnull(), 'NoDays_SLS_to_LS']= aggregated['max_date'] - aggregated['aux_date']

# drop the columns that were just inserted for calculation
aggregated.drop(['Date', 'aux_date', 'max_date'], axis='columns', inplace=True)

The output is:

    No_of_F  No_of_M  No_of_P NoDays_to_SLS NoDays_SLS_to_LS
ID                                                          
1         3        2        2      487 days         151 days
2         3        3        1      487 days          61 days
3         3        2        2      640 days          90 days
4         3        2        1      518 days         151 days
5         2        2        2      487 days           0 days

Note: I inserted ID 5 to show what I mean by that I don't rely on the last record to be an F. ID 5 is just a copy of the rows of ID 1 with the last F-line remove. That's why NoDays_SLS_to_LS is zero for that line.

The NoDays...-columns are of type timedelta64. If you rather want to store them as ints, you can do:

aggregated['NoDays_to_SLS']=    aggregated['NoDays_to_SLS'].dt.days
aggregated['NoDays_SLS_to_LS']= aggregated['NoDays_SLS_to_LS'].dt.days

Grouby filter, count each category, duration and identify second last based on date

Answers (2)

Related Questions