YYL
YYL

Reputation: 11

pandas combine rows with overlapping values

I'd like use pandas to group overlapping intervals within the same block number. This is the input df:

block   start   end
1   100 105
1   105 200
1   195 205
2   1   10
2   10  25
2   15  20
2   18  30
2   30  40

the expected output is

block   start   end
1   100 105
1   105 205
2   1   10
2   10  30
2   30  40

Upvotes: 1

Views: 154

Answers (2)

David Erickson
David Erickson

Reputation: 16683

first create a grp, which you then use in your groupby:

grp = (df.groupby('block').apply(lambda x: (x['start'].ge(x['end'].shift())
                                            .cumsum())).reset_index(drop=True))
df.groupby(['block', grp], as_index=False).agg({'start': 'min', 'end' : 'max'})
Out[1]: 
   block  start  end
0      1    100  105
1      1    105  205
2      2      1   10
3      2     10   30
4      2     30   40

For creation of grp:

  1. You know what you want to groupby -- block
  2. The hard part is how you can possibly do some sort of function after gouping by block
  3. You can achieve this by lambda x on the dataframe where x is the dataframe and you can use multiple columns. If you call just a column after the groupby syntax, then you cannot compare multiple columns in a function.
  4. The lambda x: basically looks at any given start row per group and compares to the previous end row with .ge (greater than or equal) and .shift(). That returns True or False row-by-row in a multi-index series, so you are ready to take the cumsum() to get the group numbers per block and .reset_index() so that the index is the same as the dataframe and can be used in a .groupby()
  5. So, use this series grp as another series to include in a groupby to transform your dataframe as required with agg

Upvotes: 4

Clay Shwery
Clay Shwery

Reputation: 380

This solution works.

The key is to identify which records need to be joined, (those where the diff in end time between it and the previous record is greater than the diff between it's own start and end time), then join them using groupby.

Repeat until all overlapping records have been merged.

df_txt = """
bl  st  end
1   100 105
1   105 200
1   195 205
2   1   10
2   10  25
2   15  20
2   18  30
2   30  40"""

df = pd.read_fwf(io.StringIO(df_txt)).rename(columns={'bl':'block','st':'start'})

df['next_end_diff'] = df.groupby('block')['end'].diff()
df['st_to_st_diff'] = df.groupby('block')['start'].diff()
df['st_to_end_diff'] = df['end']-df['start']
df_mask = df['next_end_diff']<df['st_to_end_diff']
df.loc[df_mask,'start'] = df.loc[df_mask,'start'] - df.loc[df_mask,'st_to_st_diff']
df = df.groupby(['block','start'])['end'].max().reset_index()
while sum(df_mask)>0:
    df['next_end_diff'] = df.groupby('block')['end'].diff()
    df['st_to_st_diff'] = df.groupby('block')['start'].diff()
    df['st_to_end_diff'] = df['end']-df['start']
    df_mask = df['next_end_diff']<df['st_to_end_diff']
    df.loc[df_mask,'start'] = df.loc[df_mask,'start'] - df.loc[df_mask,'st_to_st_diff']
    df = df.groupby(['block','start'])['end'].max().reset_index()

Upvotes: 0

Related Questions