Danish
Danish

Reputation: 2871

Groupby by sort based on date time, groupby sequence based on 'ID' and Date and then mean by sequence

I am new in pandas functionality. I have a DF as shown below. which is repair data of mobiles.

  ID    Status       Date     Cost
0   1     F       22-Jun-17   500
1   1     M       22-Jul-17   100
2   2     M       29-Jun-17   200
3   3     M       20-Mar-17   300
4   4     M       10-Aug-17   800
5   2     F       29-Sep-17   600
6   2     F       29-Jan-18   500
7   1     F       22-Jun-18   600
8   3     F       20-Jun-18   700
9   1     M       22-Aug-18   150
10  1     F       22-Mar-19   750
11  3     M       20-Oct-18   250
12  4     F       10-Jun-18   100

I tried to find out the duration for each id from previous status.

find the mean for each status sequence for that ID.

My expected output is shown below.

    ID  S1    S1_Dur  S2    S2_dur  S3   S3_dur  S4   S4_dur    Avg_MF  Avg_FM
0   1   F-M    30     M-F   335.00  F-M  61.00   M-F  750.00    542.50  45.50
1   2   M-F    92     F-F   122.00  NaN  nan     NaN    nan     92.00   nan
2   3   M-F    457    F-M   122.00  NaN  nan     NaN    nan     457.00  122.00
3   4   M-F    304    NaN   nan     NaN  nan     NaN    nan     304.00  nan


S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration

I tried following codes

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)

Then I got a DF as shown below

   ID   Status  index   Date    Cost    Difference
0   1   F        0  2017-06-22  500     NaT
1   1   M        1  2017-07-22  100     30 days
2   1   F        7  2018-06-22  600     335 days
3   1   M        9  2018-08-22  150     61 days
4   1   F       10  2019-03-22  750     212 days
5   2   M        2  2017-06-29  200     NaT
6   2   F        5  2017-09-29  600     92 days
7   2   F        6  2018-01-29  500     122 days
8   3   M        3  2017-03-20  300     NaT
9   3   F        8  2018-06-20  700     457 days
10  3   M       11  2018-10-20  250     122 days
11  4   M        4  2017-08-10  800     NaT
12  4   F       12  2018-06-10  100     304 days

After that I am stuck.

Upvotes: 1

Views: 90

Answers (1)

jezrael
jezrael

Reputation: 863166

Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:

df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])

df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)

df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')

df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)

df3 = df2.join(df1).reset_index()
print (df3)

   ID    D_1  S_1    D_2  S_2   D_3  S_3    D_4  S_4  Avg_F-F  Avg_F-M  \
0   1   30.0  F-M  335.0  M-F  61.0  F-M  212.0  M-F      NaN     45.5   
1   2   92.0  M-F  122.0  F-F   NaN  NaN    NaN  NaN    122.0      NaN   
2   3  457.0  M-F  122.0  F-M   NaN  NaN    NaN  NaN      NaN    122.0   
3   4  304.0  M-F    NaN  NaN   NaN  NaN    NaN  NaN      NaN      NaN   

   Avg_M-F  
0    273.5  
1     92.0  
2    457.0  
3    304.0 

Upvotes: 1

Related Questions