Reputation: 23
I need help with comparing dates in different rows and in different columns and making sure that they follow a chronological order.
First, I group data based on Id and group columns. Next, each date value is supposed to occur in the future.
The first group [1111 + A ] contains an error because the dates don't follow a chronological order :
1/1/2016 > 2/20/2016 > **2/19/2016** > 4/25/2016 > **4/1/2016** > 5/1/2016
Current result
id start end group
0 1111 01/01/2016 02/20/2016 A
1 1111 02/19/2016 04/25/2016 A
2 1111 04/01/2016 05/01/2016 A
3 2345 05/01/2016 05/28/2016 B
4 2345 05/29/2016 06/28/2016 B
5 1234 08/01/2016 09/16/2016 F
6 9882 01/01/2016 08/29/2016 D
7 9992 03/01/2016 03/15/2016 C
8 9992 03/16/2016 08/03/2016 C
9 9992 05/16/2016 09/16/2016 C
10 9992 09/17/2016 10/16/2016 C
11 9992 10/17/2016 12/13/2016 C
The answer should be:
1/1/2016 > 2/20/2016 > **2/21/2016** > 4/25/2016 > **4/26/2016** > 5/1/2016
Desired output
id start end group
0 1111 01/01/2016 02/20/2016 A
1 1111 02/21/2016 04/25/2016 A
2 1111 04/26/2018 05/01/2016 A
3 2345 05/01/2016 05/28/2016 B
4 2345 05/29/2016 06/28/2016 B
5 1234 08/01/2016 09/16/2016 F
6 9882 01/01/2016 08/29/2016 C
7 9992 03/01/2016 03/15/2016 C
8 9992 03/16/2016 08/03/2016 C
9 9992 08/04/2016 09/16/2016 C
10 9992 09/17/2016 10/16/2016 C
11 9992 10/17/2016 12/13/2016 C
Any help will be greatly appreciated.
Upvotes: 2
Views: 629
Reputation: 164823
One way is to apply your logic to each group, then concatenate your groups.
# convert series to datetime
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# iterate groups and add results to grps list
grps = []
for _, group in df.groupby(['id', 'group'], sort=False):
end_shift = group['end'].shift()
group.loc[group['start'] <= end_shift, 'start'] = end_shift + pd.DateOffset(1)
grps.append(group)
# concatenate dataframes in grps to build a single dataframe
res = pd.concat(grps, ignore_index=True)
print(res)
id start end group
0 1111 2016-01-01 2016-02-20 A
1 1111 2016-02-21 2016-04-25 A
2 1111 2016-04-26 2016-05-01 A
3 2345 2016-05-01 2016-05-28 B
4 2345 2016-05-29 2016-06-28 B
5 1234 2016-08-01 2016-09-16 F
6 9882 2016-01-01 2016-08-29 D
7 9992 2016-03-01 2016-03-15 C
8 9992 2016-03-16 2016-08-03 C
9 9992 2016-08-04 2016-09-16 C
10 9992 2016-09-17 2016-10-16 C
11 9992 2016-10-17 2016-12-13 C
Upvotes: 2
Reputation: 51425
I believe this should work:
# First make sure your column are datetimes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# Get your new start times:
new_times = (df.groupby(['id', 'group'])
.apply(lambda x: (x.end + pd.Timedelta(days=1)).shift())
.reset_index(['id', 'group'], drop=True))
# put back into original dataframe
df.loc[new_times.notnull(), 'start'] = new_times[new_times.notnull()]
>>> df
id start end group
0 1111 2016-01-01 2016-02-20 A
1 1111 2016-02-21 2016-04-25 A
2 1111 2016-04-26 2016-05-01 A
3 2345 2016-05-01 2016-05-28 B
4 2345 2016-05-29 2016-06-28 B
5 1234 2016-08-01 2016-09-16 F
6 9882 2016-01-01 2016-08-29 D
7 9992 2016-03-01 2016-03-15 C
8 9992 2016-03-16 2016-08-03 C
9 9992 2016-08-04 2016-09-16 C
10 9992 2016-09-17 2016-10-16 C
11 9992 2016-10-17 2016-12-13 C
Explanation:
new_times
looks like this:
>>> new_times
0 NaT
1 2016-02-21
2 2016-04-26
5 NaT
3 NaT
4 2016-05-29
6 NaT
7 NaT
8 2016-03-16
9 2016-08-04
10 2016-09-17
11 2016-10-17
You can then use df.loc[new_times.notnull(), 'start'] = new_times[new_times.notnull()]
to find where new_times
is not null (i.e. where it is not the first row in a given group), and insert those new_times
into your original start
column.
Upvotes: 1