Melsauce
Melsauce

Reputation: 2655

How to find overlapping rows between two dataframes based on start and end columns?

I have two pandas dataframes df1 and df2 of the form:

df1

start   end   text   source
1       5     abc     1  
8       10    def     1
15      20    ghi     1
25      30    xxx     1
42      45    zzz     1 

df2

start   end   text   source
1       6     jkl     2  
7       9     mno     2
11      13    pqr     2
16      17    stu     2
18      19    vwx     2
32      37    yyy     2
40      47    rrr     2

I want to return the intersections of the two dataframes based on the start and end columns in following format:

out_df

start_1   end_1   start_2   end_2  text_1   text_2
1         5       1         6      abc      jkl        
8         10      7         9      def      mno
15        20      16        17     ghi      stu 
15        20      18        19     ghi      vwx
42        45      40        47     zzz      rrr

What is the best method to achieve this?

Upvotes: 1

Views: 599

Answers (1)

sammywemmy
sammywemmy

Reputation: 28709

One option is with conditional_join from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df1.conditional_join(
      df2, 
      ('start', 'end', '<='), 
      ('end', 'start', '>='))

   left                 right
  start end text source start end text source
0     1   5  abc      1     1   6  jkl      2
1     8  10  def      1     7   9  mno      2
2    15  20  ghi      1    16  17  stu      2
3    15  20  ghi      1    18  19  vwx      2
4    42  45  zzz      1    40  47  rrr      2

In the dev version, you can rename the columns, and avoid the MultiIndex (the MultiIndex occurs because the column names are not unique):

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

df1.conditional_join(
        df2, 
        ('start', 'end', '<='), 
        ('end', 'start', '>='), 
        df_columns = {'start':'start_1', 
                      'end':'end_1', 
                      'text':'text_1'},
        right_columns = {'start':'start_2', 
                         'end':'end_2', 
                         'text':'text_2'})

   start_1  end_1 text_1  start_2  end_2 text_2
0        1      5    abc        1      6    jkl
1        8     10    def        7      9    mno
2       15     20    ghi       16     17    stu
3       15     20    ghi       18     19    vwx
4       42     45    zzz       40     47    rrr

The idea for overlaps is the start of interval one should be less than the end of interval 2, while the end of interval two should be less than the start of interval one, that way overlap is assured. I pulled that idea from pd.Interval.overlaps here

Another option is with the piso library; the answer here might point you in the right direction

Upvotes: 3

Related Questions