Reputation: 47
I have a dataframe as below (Slowly changing Dimension). I want to fill end_dt column with the next row start_dt for a particular group identified by key column.
Key start_dt end_dt
1 25-05-2019 NA
1 30-06-2019 NA
1 15-07-2019 NA
2 17-07-2019 NA
2 15-07-2019 NA
I want the data to be populated for end_dt as below.
Key start_dt end_dt
1 25-05-2019 30-06-2019
1 30-06-2019 15-07-2019
1 15-07-2019 NA
2 17-07-2019 NA
2 15-07-2019 17-07-2019
What i tried is as below. I have generated a sequence for each key group starting with 2 after sorting start_dt ascending. And then created a new dataframe with start_dt and same key column, and then sorting them based on start_dt, then creating a sequence for key group starting with 1.
Then joined these two dataframes with the sequence column. Please help for an optimal solution.
df.sort_values(['start_dt'],ascending=[True],inplace=True)
df['seq'] = df.groupby(['key']).cumcount() + 2
temp= pd.DataFrame(df)
temp.sort_values(['start_dt'],ascending=[True],inplace=True)
temp=temp.rename(columns = {'start_dt':'end_dt'})
temp['seq'] = temp.groupby(['key']).cumcount() + 1
df = pd.merge(df, temp[['key','seq','end_dt']],how='left', on=['key','seq'])
Upvotes: 1
Views: 551
Reputation: 88226
It seems like you first need to sort_values
both by Key
and start_dt
, and then you can use GroupBy
and shift
to assign the following start_dt
:
df = df.sort_values(['Key', 'start_dt'])
df['end_dt'] = df.groupby('Key').start_dt.shift(-1)
print(df.reset_index(drop=True))
Key start_dt end_dt
0 1 15-07-2019 25-05-2019
1 1 25-05-2019 30-06-2019
2 1 30-06-2019 NaN
3 2 15-07-2019 17-07-2019
4 2 17-07-2019 NaN
Upvotes: 3