Reputation: 31
I have data frame like this:
Name1 Name2 Start End
aaa bbb 1 2
aaa bbb 2 22
aaa bbb 30 42
ccc ddd 100 141
ccc ddd 145 160
ccc ddd 160 178
How do I merge rows that the end time of the first row is equal to the start time of the second row, otherwise keep the row as is. The expected result look like this:
Name1 Name2 Start End
aaa bbb 1 22
aaa bbb 30 42
ccc ddd 100 141
ccc ddd 145 178
I can do this use iterrow, but I am wondering if there is a better way like apply or groupby to do so.
Upvotes: 2
Views: 721
Reputation: 215117
To rephrase the problem, you need to find intervals that don't overlap: if we sort Start
column in ascending order, then whenever the cumulative maximum End
is smaller than the next Start
, you have a new interval, and based on this observation, you can create a new group variable and aggregate new Start
and End
for the merged intervals:
df.sort_values('Start', inplace=True)
df.groupby(['Name1', 'Name2']).apply(
lambda g: g.groupby((g.End.cummax().shift() < g.Start).cumsum()).agg({'Start': min, 'End': max})
).reset_index(level=[0,1])
Name1 Name2 Start End
0 aaa bbb 1 22
1 aaa bbb 30 42
0 ccc ddd 100 141
1 ccc ddd 145 178
Upvotes: 2