mcortez4790
mcortez4790

Reputation: 1

Pandas Transform on each group dataframe (performance)

I have a dataset and I need to perform some calculations on it. This is a sample of it:

id         group_id        start_date   end_date
45891167   00000dc8-b87e   2017-09-14   2017-12-15
45891167   00000dc8-b87e   2017-12-15   2018-03-18
50692679   000012e9-e137   2018-03-29   2019-03-05
50692679   000012e9-e137   2019-03-05   2019-03-27
50692679   000012e9-e137   2019-03-27   2019-04-05
51622620   000011e5-e107   2021-03-15   NaT
81652621   000019e1-e111   2020-06-15   2020-06-19
81652621   000019e1-e111   2020-07-16   NaT

I want the transformed df to look like this (see below). The logic I wanted to apply is the following: if the difference between the end_date of the prior value and the start_date of the current value is higher than 20 days set start_date as the row's start_date. If not use start_date of the prior row. This logic should be applied for each group_id.

id         group_id        start_date   end_date    first_date    
45891167   00000dc8-b87e   2017-09-14   2017-12-15  2017-09-14
45891167   00000dc8-b87e   2017-12-15   2018-03-18  2017-09-14
50692679   000012e9-e137   2018-03-29   2019-03-05  2018-03-29 
50692679   000012e9-e137   2019-03-07   2019-03-27  2018-03-29
50692679   000012e9-e137   2019-03-27   2019-04-05  2018-03-29
51622620   000011e5-e107   2021-03-15   NaT         2021-03-15
81652621   000019e1-e111   2020-06-15   2020-06-19  2020-06-15
81652621   000019e1-e111   2021-07-16   NaT         2021-07-16

I have created this transformation logic to get the expected input. The issue is the original dataset has over 1M distinct group_id and by using the for loop it seems this approach isn't going to be performant.

for group in df['group_id'].unique():
    indexes = (df.loc[df['group_id'] == group]).index.tolist()

    df['end_date_previous_row'] = (df.loc[df['group_id'] == group])['end_date'].shift(1)

    if len((df.loc[df['group_id'] == group]).index) == 1:
        df['first_date'] = (df.loc[df['group_id'] == group])['start_date']
    else:
        for idx, row in (df.loc[df['group_id'] == group]).iterrows():
            if idx == indexes[0]:
                df.loc[idx, 'first_date'] = df.loc[idx, 'start_date']
            else:
                difference = row['start_date'].to_pydatetime() - row['end_date_previous_row'].to_pydatetime()
                df.loc[idx, 'difference'] = difference
                if difference.days >= 20:
                    df.loc[idx, 'first_date'] = df.loc[idx, 'start_date']
                else:
                    df['first_date'].fillna(method='ffill', limit=1, inplace=True)

Is there a more performant approach? Maybe using groupby and transform?

Upvotes: 0

Views: 81

Answers (1)

BENY
BENY

Reputation: 323226

The idea here is use groupby with transform

df['start_date'] = pd.to_datetime(df['start_date'])
g = df.groupby('id')
s = (df['start_date'] - g['end_date'].shift()).dt.days>=20
df['first_date '] = g['start_date'].transform('first').mask(s,df['start_date'])
Out[921]: 
0   2017-09-14
1   2017-09-14
2   2018-03-29
3   2018-03-29
4   2018-03-29
5   2021-03-15
6   2020-06-15
7   2020-07-16
Name: start_date, dtype: datetime64[ns]

Upvotes: 1

Related Questions