John
John

Reputation: 835

Merge intervals in pandas dataframe based on another column

I have a pandas dataframe that looks like as follows:

    ID     start   end     name
0   1       119    200      ss
1   1       118    156      ss  
2.  1       110    200      ss
3   1       15     25       me   
4   4       30     40       gg
5   4       30     55       gg

What I want do is to merge the overlapping intervals that have the same name (name column), and whose coordinates (start,end) overlap. So the resulting the dataframe will look like:

    ID     start   end     name
0   1       110    200      ss
1   1       15     25       me   
2   4       30     55       gg

For example for ss in name column the lowest start value is 110 and the highest end value is 200. Therefore, the new dataframe will have 110 has start and 200 as end. How can I achieve this? Insights will be appreciated.

Upvotes: 2

Views: 473

Answers (2)

jezrael
jezrael

Reputation: 863291

If possible simplify ouput by aggregate min and max per name and ID column with same order of groups and same order of columns in ouput use:

df1 = (df.groupby(['name', 'ID'], sort=False, as_index=False)
         .agg({'start': min, 'end': max})
         .reindex(df.columns, axis=1))
print (df1)
   ID  start  end name
0   1    110  200   ss
1   1     15   25   me
2   4     30   55   gg

Upvotes: 2

Bertil Johannes Ipsen
Bertil Johannes Ipsen

Reputation: 1776

Starting with this

     ID  start  end name
0.0   1    119  200   ss
1.0   1    118  156   ss
2.0   1    110  200   ss
3.0   1     15   25   me
4.0   4     30   40   gg
5.0   4     30   55   gg

We can groupby and aggregate min and max. For the ones where we just want to keep the value that is the same, we can use min to grap the first one.

df.groupby('name').agg({'ID': min, 'start': min, 'end': max, 'name': min})

Result

      ID  start  end name
name                     
gg     4     30   55   gg
me     1     15   25   me
ss     1    110  200   ss

Upvotes: 3

Related Questions