Reputation: 11
I'd like use pandas to group overlapping intervals within the same block number. This is the input df:
block start end
1 100 105
1 105 200
1 195 205
2 1 10
2 10 25
2 15 20
2 18 30
2 30 40
the expected output is
block start end
1 100 105
1 105 205
2 1 10
2 10 30
2 30 40
Upvotes: 1
Views: 154
Reputation: 16683
first create a grp
, which you then use in your groupby
:
grp = (df.groupby('block').apply(lambda x: (x['start'].ge(x['end'].shift())
.cumsum())).reset_index(drop=True))
df.groupby(['block', grp], as_index=False).agg({'start': 'min', 'end' : 'max'})
Out[1]:
block start end
0 1 100 105
1 1 105 205
2 2 1 10
3 2 10 30
4 2 30 40
For creation of grp
:
block
block
lambda x
on the dataframe
where x
is the dataframe and you can use multiple columns. If you call just a column after the groupby
syntax, then you cannot compare multiple columns in a function.lambda x:
basically looks at any given start
row per group and compares to the previous end
row with .ge
(greater than or equal) and .shift()
. That returns True
or False
row-by-row in a multi-index series, so you are ready to take the cumsum()
to get the group numbers per block and .reset_index()
so that the index is the same as the dataframe and can be used in a .groupby()
grp
as another series to include in a groupby
to transform your dataframe as required with agg
Upvotes: 4
Reputation: 380
This solution works.
The key is to identify which records need to be joined, (those where the diff in end time between it and the previous record is greater than the diff between it's own start and end time), then join them using groupby.
Repeat until all overlapping records have been merged.
df_txt = """
bl st end
1 100 105
1 105 200
1 195 205
2 1 10
2 10 25
2 15 20
2 18 30
2 30 40"""
df = pd.read_fwf(io.StringIO(df_txt)).rename(columns={'bl':'block','st':'start'})
df['next_end_diff'] = df.groupby('block')['end'].diff()
df['st_to_st_diff'] = df.groupby('block')['start'].diff()
df['st_to_end_diff'] = df['end']-df['start']
df_mask = df['next_end_diff']<df['st_to_end_diff']
df.loc[df_mask,'start'] = df.loc[df_mask,'start'] - df.loc[df_mask,'st_to_st_diff']
df = df.groupby(['block','start'])['end'].max().reset_index()
while sum(df_mask)>0:
df['next_end_diff'] = df.groupby('block')['end'].diff()
df['st_to_st_diff'] = df.groupby('block')['start'].diff()
df['st_to_end_diff'] = df['end']-df['start']
df_mask = df['next_end_diff']<df['st_to_end_diff']
df.loc[df_mask,'start'] = df.loc[df_mask,'start'] - df.loc[df_mask,'st_to_st_diff']
df = df.groupby(['block','start'])['end'].max().reset_index()
Upvotes: 0