Pandas joining dataframes with different intervals in matching columns

Question

I'm a bit confused as to how to even formulate the question properly. I have two Pandas dataframes:

data = {'ID':['A1','A1','A2','A2','A2'], 'FROM':[0,2,0,2,4], 'TO':[2,4,2,4,6], 'PYR' : [0.25,0.11,0.05,0,0.5]}

df = pd.DataFrame(data, columns = ['ID', 'FROM', 'TO', 'PYR'])

So the df looks like so:

   ID  FROM  TO   PYR
0  A1     0   2  0.25
1  A1     2   4  0.11
2  A2     0   2  0.05
3  A2     2   4  0.00
4  A2     4   6  0.50

And the second:

new_data = {'ID':['A1','A2','A2'], 'FROM':[0, 0, 3.5], 'TO':[4, 3.5, 6], 'STRAT':['TD3', 'J1','J2']}

df2 = pd.DataFrame(new_data, columns = ['ID', 'FROM', 'TO', 'STRAT'])

   ID  FROM   TO STRAT
0  A1   0.0  4.0   TD3
1  A2   0.0  3.5    J1
2  A2   3.5  6.0    J2

What i want to do is add STRAT from the second dataframe to the first one. Both dataframes have same full ranges for every id, but individual intervals are obviously different.

What I want is to fill in STRAT in such a way that if it overlaps with more than 50% of the interval of the first dataframe, it will get assigned to that interval so the expected result looks like this:

   ID  FROM  TO   PYR STRAT
0  A1     0   2  0.25   TD3
1  A1     2   4  0.11   TD3
2  A2     0   2  0.05    J1
3  A2     2   4  0.00    J1
4  A2     4   6  0.50    J2

I'm not quite sure how to approach this question. I would appreciate if someone could point me in a right direction. Thanks!

Bernardo Alencar · Accepted Answer

What you can do is: merge the data frames then filter what is outside the expected interval. This would be:

# Merge
df = df.merge(df2, on='ID', suffixes=('_1', '_2'))

# Calculate interval overlap
amount_overlap = (df[['TO_1', 'TO_2']].min(axis=1) -
    df[['FROM_2', 'FROM_1']].max(axis=1))

# Filter rows where overlap under 50% of df TO-FROM interval
df = df[(amount_overlap)/(df.TO_1 - df.FROM_1) > 0.5]

If necessary, you can revert columns names:

df = df.rename(columns={'TO_1':'TO', 'FROM_1': 'FROM'})

And drop unnecessary columns:

df = df.drop(['TO_2', 'FROM_2'], axis=1)

Pandas joining dataframes with different intervals in matching columns

Answers (2)

Related Questions