Reputation: 4347
I'm a bit confused as to how to even formulate the question properly. I have two Pandas dataframes:
data = {'ID':['A1','A1','A2','A2','A2'], 'FROM':[0,2,0,2,4], 'TO':[2,4,2,4,6], 'PYR' : [0.25,0.11,0.05,0,0.5]}
df = pd.DataFrame(data, columns = ['ID', 'FROM', 'TO', 'PYR'])
So the df looks like so:
ID FROM TO PYR
0 A1 0 2 0.25
1 A1 2 4 0.11
2 A2 0 2 0.05
3 A2 2 4 0.00
4 A2 4 6 0.50
And the second:
new_data = {'ID':['A1','A2','A2'], 'FROM':[0, 0, 3.5], 'TO':[4, 3.5, 6], 'STRAT':['TD3', 'J1','J2']}
df2 = pd.DataFrame(new_data, columns = ['ID', 'FROM', 'TO', 'STRAT'])
ID FROM TO STRAT
0 A1 0.0 4.0 TD3
1 A2 0.0 3.5 J1
2 A2 3.5 6.0 J2
What i want to do is add STRAT from the second dataframe to the first one. Both dataframes have same full ranges for every id, but individual intervals are obviously different.
What I want is to fill in STRAT in such a way that if it overlaps with more than 50% of the interval of the first dataframe, it will get assigned to that interval so the expected result looks like this:
ID FROM TO PYR STRAT
0 A1 0 2 0.25 TD3
1 A1 2 4 0.11 TD3
2 A2 0 2 0.05 J1
3 A2 2 4 0.00 J1
4 A2 4 6 0.50 J2
I'm not quite sure how to approach this question. I would appreciate if someone could point me in a right direction. Thanks!
Upvotes: 1
Views: 209
Reputation: 81
What you can do is: merge the data frames then filter what is outside the expected interval. This would be:
# Merge
df = df.merge(df2, on='ID', suffixes=('_1', '_2'))
# Calculate interval overlap
amount_overlap = (df[['TO_1', 'TO_2']].min(axis=1) -
df[['FROM_2', 'FROM_1']].max(axis=1))
# Filter rows where overlap under 50% of df TO-FROM interval
df = df[(amount_overlap)/(df.TO_1 - df.FROM_1) > 0.5]
If necessary, you can revert columns names:
df = df.rename(columns={'TO_1':'TO', 'FROM_1': 'FROM'})
And drop unnecessary columns:
df = df.drop(['TO_2', 'FROM_2'], axis=1)
Upvotes: 2
Reputation: 150785
I would do a cross join on ID
, then filter those valid (FROM-TO overlaps) and then groupby ID, FROM, TO
and take the max overlap
new_df = (df.merge(df2, on='ID', suffixes=['','_tmp'])
.query('(FROM_tmp <= FROM & TO <= TO_tmp) | \
(FROM <= FROM_tmp <= TO) | \
(FROM <= TO_tmp <= TO)'
)
)
s1 = (new_df['FROM_tmp'].le(new_df['FROM']) &
new_df['TO'].le( new_df['TO_tmp'])
)
s2 = (new_df['FROM_tmp'].ge(new_df['FROM']) &
new_df['FROM_tmp'].le( new_df['TO'])
)
new_df['overlap'] = np.select((s1,s2),
(new_df['TO_tmp'] - new_df['FROM_tmp'],
new_df['TO'] - new_df['FROM_tmp']),
new_df['TO_tmp'] - new_df['FROM']
)
# output
(new_df.loc[new_df.groupby(['ID','FROM', 'TO'])
.overlap.idxmax()]
.drop(['FROM_tmp', 'TO_tmp', 'overlap'], axis=1)
)
Output:
ID FROM TO PYR STRAT
0 A1 0 2 0.25 TD3
1 A1 2 4 0.11 TD3
2 A2 0 2 0.05 J1
4 A2 2 4 0.00 J1
7 A2 4 6 0.50 J2
Upvotes: 1