NotAName
NotAName

Reputation: 4347

Pandas joining dataframes with different intervals in matching columns

I'm a bit confused as to how to even formulate the question properly. I have two Pandas dataframes:

data = {'ID':['A1','A1','A2','A2','A2'], 'FROM':[0,2,0,2,4], 'TO':[2,4,2,4,6], 'PYR' : [0.25,0.11,0.05,0,0.5]}

df = pd.DataFrame(data, columns = ['ID', 'FROM', 'TO', 'PYR'])

So the df looks like so:

   ID  FROM  TO   PYR
0  A1     0   2  0.25
1  A1     2   4  0.11
2  A2     0   2  0.05
3  A2     2   4  0.00
4  A2     4   6  0.50

And the second:

new_data = {'ID':['A1','A2','A2'], 'FROM':[0, 0, 3.5], 'TO':[4, 3.5, 6], 'STRAT':['TD3', 'J1','J2']}

df2 = pd.DataFrame(new_data, columns = ['ID', 'FROM', 'TO', 'STRAT'])

   ID  FROM   TO STRAT
0  A1   0.0  4.0   TD3
1  A2   0.0  3.5    J1
2  A2   3.5  6.0    J2

What i want to do is add STRAT from the second dataframe to the first one. Both dataframes have same full ranges for every id, but individual intervals are obviously different.

What I want is to fill in STRAT in such a way that if it overlaps with more than 50% of the interval of the first dataframe, it will get assigned to that interval so the expected result looks like this:

   ID  FROM  TO   PYR STRAT
0  A1     0   2  0.25   TD3
1  A1     2   4  0.11   TD3
2  A2     0   2  0.05    J1
3  A2     2   4  0.00    J1
4  A2     4   6  0.50    J2

I'm not quite sure how to approach this question. I would appreciate if someone could point me in a right direction. Thanks!

Upvotes: 1

Views: 209

Answers (2)

Bernardo Alencar
Bernardo Alencar

Reputation: 81

What you can do is: merge the data frames then filter what is outside the expected interval. This would be:

# Merge
df = df.merge(df2, on='ID', suffixes=('_1', '_2'))

# Calculate interval overlap
amount_overlap = (df[['TO_1', 'TO_2']].min(axis=1) -
    df[['FROM_2', 'FROM_1']].max(axis=1))

# Filter rows where overlap under 50% of df TO-FROM interval
df = df[(amount_overlap)/(df.TO_1 - df.FROM_1) > 0.5]

If necessary, you can revert columns names:

df = df.rename(columns={'TO_1':'TO', 'FROM_1': 'FROM'})

And drop unnecessary columns:

df = df.drop(['TO_2', 'FROM_2'], axis=1)

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150785

I would do a cross join on ID, then filter those valid (FROM-TO overlaps) and then groupby ID, FROM, TO and take the max overlap

new_df = (df.merge(df2, on='ID', suffixes=['','_tmp'])
           .query('(FROM_tmp <= FROM & TO <= TO_tmp) | \
                   (FROM <= FROM_tmp <= TO) | \
                   (FROM <= TO_tmp <= TO)'
                 )
)
s1 = (new_df['FROM_tmp'].le(new_df['FROM']) &
      new_df['TO'].le( new_df['TO_tmp'])
     )
s2 = (new_df['FROM_tmp'].ge(new_df['FROM']) &
      new_df['FROM_tmp'].le( new_df['TO'])
     )
new_df['overlap'] = np.select((s1,s2),
                              (new_df['TO_tmp'] - new_df['FROM_tmp'],
                               new_df['TO'] - new_df['FROM_tmp']),
                               new_df['TO_tmp'] - new_df['FROM']                            
                             )

# output
(new_df.loc[new_df.groupby(['ID','FROM', 'TO'])
               .overlap.idxmax()]
     .drop(['FROM_tmp', 'TO_tmp', 'overlap'], axis=1)
)

Output:

   ID  FROM  TO   PYR STRAT
0  A1     0   2  0.25   TD3
1  A1     2   4  0.11   TD3
2  A2     0   2  0.05    J1
4  A2     2   4  0.00    J1
7  A2     4   6  0.50    J2

Upvotes: 1

Related Questions