Reputation: 654
I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end
of n-1 row and start
of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']
. How can I gather ids based on the condition for each different id ?
Thanks !
Upvotes: 0
Views: 68
Reputation: 862901
I believe you can create groups by suntract shifted end
by DataFrameGroupBy.shift
with greater like 10
and cumulative sum and pass to GroupBy.agg
:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30
Upvotes: 1