Reputation: 1
I have a dataset and I need to perform some calculations on it. This is a sample of it:
id group_id start_date end_date
45891167 00000dc8-b87e 2017-09-14 2017-12-15
45891167 00000dc8-b87e 2017-12-15 2018-03-18
50692679 000012e9-e137 2018-03-29 2019-03-05
50692679 000012e9-e137 2019-03-05 2019-03-27
50692679 000012e9-e137 2019-03-27 2019-04-05
51622620 000011e5-e107 2021-03-15 NaT
81652621 000019e1-e111 2020-06-15 2020-06-19
81652621 000019e1-e111 2020-07-16 NaT
I want the transformed df to look like this (see below). The logic I wanted to apply is the following: if the difference between the end_date
of the prior value and the start_date
of the current value is higher than 20 days set start_date
as the row's start_date
. If not use start_date
of the prior row. This logic should be applied for each group_id
.
id group_id start_date end_date first_date
45891167 00000dc8-b87e 2017-09-14 2017-12-15 2017-09-14
45891167 00000dc8-b87e 2017-12-15 2018-03-18 2017-09-14
50692679 000012e9-e137 2018-03-29 2019-03-05 2018-03-29
50692679 000012e9-e137 2019-03-07 2019-03-27 2018-03-29
50692679 000012e9-e137 2019-03-27 2019-04-05 2018-03-29
51622620 000011e5-e107 2021-03-15 NaT 2021-03-15
81652621 000019e1-e111 2020-06-15 2020-06-19 2020-06-15
81652621 000019e1-e111 2021-07-16 NaT 2021-07-16
I have created this transformation logic to get the expected input. The issue is the original dataset has over 1M distinct group_id
and by using the for loop it seems this approach isn't going to be performant.
for group in df['group_id'].unique():
indexes = (df.loc[df['group_id'] == group]).index.tolist()
df['end_date_previous_row'] = (df.loc[df['group_id'] == group])['end_date'].shift(1)
if len((df.loc[df['group_id'] == group]).index) == 1:
df['first_date'] = (df.loc[df['group_id'] == group])['start_date']
else:
for idx, row in (df.loc[df['group_id'] == group]).iterrows():
if idx == indexes[0]:
df.loc[idx, 'first_date'] = df.loc[idx, 'start_date']
else:
difference = row['start_date'].to_pydatetime() - row['end_date_previous_row'].to_pydatetime()
df.loc[idx, 'difference'] = difference
if difference.days >= 20:
df.loc[idx, 'first_date'] = df.loc[idx, 'start_date']
else:
df['first_date'].fillna(method='ffill', limit=1, inplace=True)
Is there a more performant approach? Maybe using groupby
and transform
?
Upvotes: 0
Views: 81
Reputation: 323226
The idea here is use groupby
with transform
df['start_date'] = pd.to_datetime(df['start_date'])
g = df.groupby('id')
s = (df['start_date'] - g['end_date'].shift()).dt.days>=20
df['first_date '] = g['start_date'].transform('first').mask(s,df['start_date'])
Out[921]:
0 2017-09-14
1 2017-09-14
2 2018-03-29
3 2018-03-29
4 2018-03-29
5 2021-03-15
6 2020-06-15
7 2020-07-16
Name: start_date, dtype: datetime64[ns]
Upvotes: 1