Reputation: 2655
I have two pandas dataframes df1 and df2 of the form:
df1
start end text source
1 5 abc 1
8 10 def 1
15 20 ghi 1
25 30 xxx 1
42 45 zzz 1
df2
start end text source
1 6 jkl 2
7 9 mno 2
11 13 pqr 2
16 17 stu 2
18 19 vwx 2
32 37 yyy 2
40 47 rrr 2
I want to return the intersections of the two dataframes based on the start and end columns in following format:
out_df
start_1 end_1 start_2 end_2 text_1 text_2
1 5 1 6 abc jkl
8 10 7 9 def mno
15 20 16 17 ghi stu
15 20 18 19 ghi vwx
42 45 40 47 zzz rrr
What is the best method to achieve this?
Upvotes: 1
Views: 599
Reputation: 28709
One option is with conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(
df2,
('start', 'end', '<='),
('end', 'start', '>='))
left right
start end text source start end text source
0 1 5 abc 1 1 6 jkl 2
1 8 10 def 1 7 9 mno 2
2 15 20 ghi 1 16 17 stu 2
3 15 20 ghi 1 18 19 vwx 2
4 42 45 zzz 1 40 47 rrr 2
In the dev version, you can rename the columns, and avoid the MultiIndex (the MultiIndex occurs because the column names are not unique):
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
df1.conditional_join(
df2,
('start', 'end', '<='),
('end', 'start', '>='),
df_columns = {'start':'start_1',
'end':'end_1',
'text':'text_1'},
right_columns = {'start':'start_2',
'end':'end_2',
'text':'text_2'})
start_1 end_1 text_1 start_2 end_2 text_2
0 1 5 abc 1 6 jkl
1 8 10 def 7 9 mno
2 15 20 ghi 16 17 stu
3 15 20 ghi 18 19 vwx
4 42 45 zzz 40 47 rrr
The idea for overlaps is the start of interval one should be less than the end of interval 2, while the end of interval two should be less than the start of interval one, that way overlap is assured. I pulled that idea from pd.Interval.overlaps
here
Another option is with the piso library; the answer here might point you in the right direction
Upvotes: 3