How to merge multiple rows based on two columns in pandas

Question

I have data frame like this:

    Name1  Name2   Start End
    aaa    bbb     1     2
    aaa    bbb     2     22
    aaa    bbb     30    42
    ccc    ddd     100   141
    ccc    ddd     145   160
    ccc    ddd     160   178

How do I merge rows that the end time of the first row is equal to the start time of the second row, otherwise keep the row as is. The expected result look like this:

    Name1  Name2   Start End
    aaa    bbb     1     22
    aaa    bbb     30    42
    ccc    ddd     100   141
    ccc    ddd     145   178

I can do this use iterrow, but I am wondering if there is a better way like apply or groupby to do so.

akuiper · Accepted Answer

To rephrase the problem, you need to find intervals that don't overlap: if we sort Start column in ascending order, then whenever the cumulative maximum End is smaller than the next Start, you have a new interval, and based on this observation, you can create a new group variable and aggregate new Start and End for the merged intervals:

df.sort_values('Start', inplace=True)
df.groupby(['Name1', 'Name2']).apply(
  lambda g: g.groupby((g.End.cummax().shift() < g.Start).cumsum()).agg({'Start': min, 'End': max})
).reset_index(level=[0,1])

  Name1 Name2  Start  End
0   aaa   bbb      1   22
1   aaa   bbb     30   42
0   ccc   ddd    100  141
1   ccc   ddd    145  178

How to merge multiple rows based on two columns in pandas

Answers (1)

Related Questions