Reputation: 835
I have a pandas dataframe that looks like as follows:
ID start end name
0 1 119 200 ss
1 1 118 156 ss
2. 1 110 200 ss
3 1 15 25 me
4 4 30 40 gg
5 4 30 55 gg
What I want do is to merge the overlapping intervals that have the same name (name column), and whose coordinates (start,end) overlap. So the resulting the dataframe will look like:
ID start end name
0 1 110 200 ss
1 1 15 25 me
2 4 30 55 gg
For example for ss in name column the lowest start value is 110 and the highest end value is 200. Therefore, the new dataframe will have 110 has start and 200 as end. How can I achieve this? Insights will be appreciated.
Upvotes: 2
Views: 473
Reputation: 863291
If possible simplify ouput by aggregate min
and max
per name
and ID
column with same order of groups and same order of columns in ouput use:
df1 = (df.groupby(['name', 'ID'], sort=False, as_index=False)
.agg({'start': min, 'end': max})
.reindex(df.columns, axis=1))
print (df1)
ID start end name
0 1 110 200 ss
1 1 15 25 me
2 4 30 55 gg
Upvotes: 2
Reputation: 1776
Starting with this
ID start end name
0.0 1 119 200 ss
1.0 1 118 156 ss
2.0 1 110 200 ss
3.0 1 15 25 me
4.0 4 30 40 gg
5.0 4 30 55 gg
We can groupby and aggregate min and max. For the ones where we just want to keep the value that is the same, we can use min to grap the first one.
df.groupby('name').agg({'ID': min, 'start': min, 'end': max, 'name': min})
Result
ID start end name
name
gg 4 30 55 gg
me 1 15 25 me
ss 1 110 200 ss
Upvotes: 3