Merge two dataframes based on interval overlap

Question

I have two dataframes A and B:

For example:

import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})

A[["Start","End"]]
Out[37]:
Start   End
0   10  11
1   11  11
2   20  35
3   62  70
4   198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})

B[["Start","End","Info"]]
Out[38]:
Start   End Info
0   8   10  some_info0
1   5   90  some_info1
2   8   13  some_info2
3   60  75  some_info3

I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.

I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.

I would like something like:

A.merge(B,on="overlapping_interval", how="left")

And then drop duplicates keeping the info coming from the shorter interval.

The output should look like this:

In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})

C[["Start","End","Info"]]
Out[39]:
Start   End Info
0   10  11  some_info0
1   11  11  some_info2
2   20  35  some_info1
3   62  70  some_info3
4   198 200 NaN

I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.

Any ideas?

David Leon · Accepted Answer

I would suggest to do a function then apply on the rows:

First I compute the delta (End - Start) in B for sorting purpose

B['delta'] = B.End - B.Start

Then a function to get information:

def get_info(x):
    #Fully included
    c0 = (x.Start >= B.Start) & (x.End <= B.End)
    #start lower, end include
    c1 = (x.Start <= B.Start) & (x.End >= B.Start)
    #start include, end higher
    c2 = (x.Start <= B.End) & (x.End >= B.End)

    #filter with conditions and sort by delta
    _B = B[c0|c1|c2].sort_values('delta',ascending=True)

    return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding

Then you can apply this function to A:

A['info'] = A.apply(lambda x : get_info(x), axis='columns')


print(A)
   Start  End        info
0     10   11  some_info0
1     11   11  some_info2
2     20   35  some_info1
3     62   70  some_info3
4    198  200        None

Note:

Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

Merge two dataframes based on interval overlap

Answers (1)

Related Questions