Reputation: 3086
I have two pandas dataframes and need to match rows if two columns (with start
and end
co-ordinates) are overlapping without crossing boundaries.
For example:
df_1 = pd.DataFrame(data={'start': [0, 10, 23, 35], 'end': [5, 17, 28, 41], 'some_data_1': ['AA', 'BB', 'CC', 'DD']})
df_2 = pd.DataFrame(data={'start': [0, 12, 23, 55], 'end': [5, 17, 25, 62], 'some_data_2': ['AA_AA', 'BB_BB', 'CC_CC', 'DD_DD']})
Where
df_1 :
start end some_data_1
0 5 AA
10 17 BB
23 28 CC
35 41 DD
and
df_2 :
start end some_data_2
0 5 AA_AA
12 17 BB_BB
23 25 CC_CC
55 62 DD_DD
and the desired output is:
df_1_2 :
start_1 end_1 start_2 end_2 some_data_1 some_data_2
0 5 0 5 AA AA_AA
10 17 12 17 BB BB_BB
23 28 23 25 CC CC_CC
35 41 NaN NaN DD NaN
NaN NaN 55 62 NaN DD_DD
Is there an elegant way to check whether one segment (given by end
-start
) overlaps with another one and if they do, merge data frame on this condition.
Thanks!
Upvotes: 0
Views: 356
Reputation: 28709
Create a conditional to find if there is an overlap between the two frames, create new columns based on the conditionals, and merge, using how='outer'
What I observed from the data is that if the overlap (end-start) in df_1 is greater than or equal to the overlap in df_2, then add start_data_2, otherwise, leave as is. The calculation hinges on that; if it is a false premise OP, do let me know.
#create overlap columns
df_1['overlap']= df_1.end - df_1.start
df_2['overlap']= df_2.end - df_2.start
cond1 = df_1.overlap.ge(df_2.overlap)
df_1['key'] = np.where(cond1, df_2.some_data_2,'n1')
df_2['key'] = np.where(cond1, df_2.some_data_2,'n')
(pd
.merge(df_1,df_2,
how='outer',
on='key',
suffixes = ('_1','_2'))
.drop(['key','overlap_1','overlap_2'],
axis=1)
)
start_1 end_1 some_data_1 start_2 end_2 some_data_2
0 0.0 5.0 AA 0.0 5.0 AA_AA
1 10.0 17.0 BB 12.0 17.0 BB_BB
2 23.0 28.0 CC 23.0 25.0 CC_CC
3 35.0 41.0 DD NaN NaN NaN
4 NaN NaN NaN 55.0 62.0 DD_DD
Upvotes: 2