How to merge two dataframes with overlapping data and special requirements?

Question

I have two dataframes with various columns (sometimes the same, sometimes mostly the same--could have extra columns). I know this can be done with SQL or something else but I want to try to do it with Python. Also it isn't only one ID this is just an example of one of the IDs in the data.

Rules are: we trust df1 over df2 but if df1 is missing for a run of 20 (420-440) then I want df2 to override the -99 (missing) in df1. I don't want any override in df1 if its not 20 feet. it has to be 20 ft like this though (400-405, 405-410, 410-415, 415-420) so if there is a way to state what "origin" it starts from on the From/To that would be helpful.

df1

ID	From	To	Q	RM	RQ
MRC-10	0	5	0.005	15.0	10
MRC-10	5	10	0.5	1.8	10
MRC-10	10	15	0.21	10.7	20
MRC-10	15	20	0.20	-99.0	10
MRC-17	400	405	0.01	-99.0	10
MRC-17	405	410	0.15	-99.0	10
MRC-17	410	415	0.21	0.8	20
MRC-17	415	420	0.20	-99.0	10
MRC-17	420	425	-99.0	-99.0	10
MRC-17	425	430	-99.0	-99.0	10
MRC-17	430	435	-99.0	-99.0	10
MRC-17	435	440	-99.0	-99.0	26.67
MRC-17	440	445	0.14	-99.0	10
MRC-17	445	450	-99.0	0.04	48.57
MRC-15	100	105	10.1	-99.0	10
MRC-15	105	110	1.5	-99.0	10
MRC-15	110	115	8.8	0.8	20
MRC-15	115	120	7.3	-99.0	10
MRC-15	120	125	-99.0	-99.0	10
MRC-15	125	130	-99.0	-99.0	10
MRC-15	130	135	-99.0	-99.0	10
MRC-15	135	140	-99.0	-99.0	26.67
MRC-15	140	145	15.4	-99.0	10
MRC-15	145	150	-99.0	0.04	48.57

df2

ID	From	To	Q	RM	RQ
MRC-22	0	150	0.12	0.75	10
MRC-22	150	180	0.07	0.15	18
MRC-23	0	55	0.04	0.16	20
MRC-17	400	420	0.15	0.01	10
MRC-17	420	440	0.33	0.8	-99.0
MRC-17	440	460	0.21	0.8	20
MRC-15	100	120	0.52	0.7	10
MRC-15	120	140	0.64	0.15	10
MRC-15	140	160	0.88	0.82	20

Resulting in Final (-99 means missing for numeric, X for char):

ID	From	To	Q	RM	RQ
MRC-10	0	5	0.005	15.0	10
MRC-10	5	10	0.5	1.8	10
MRC-10	10	15	0.21	10.7	20
MRC-10	15	20	0.20	-99.0	10
MRC-15	100	105	10.1	-99.0	10
MRC-15	105	110	1.5	-99.0	10
MRC-15	110	115	8.8	0.8	20
MRC-15	115	120	7.3	-99.0	10
MRC-15	120	125	0.64	0.15	10
MRC-15	125	130	0.64	0.15	10
MRC-15	130	135	0.64	0.15	10
MRC-15	135	140	0.64	0.15	26.67
MRC-15	140	145	15.4	-99.0	10
MRC-15	145	150	-99.0	0.04	48.57
MRC-17	400	405	0.01	-99.0	10
MRC-17	405	410	0.15	-99.0	10
MRC-17	410	415	0.21	0.8	20
MRC-17	415	420	0.20	-99.0	10
MRC-17	420	440	0.33	0.8	10
MRC-17	425	430	0.33	0.8	10
MRC-17	430	435	0.33	0.8	10
MRC-17	435	440	0.33	0.8	26.67
MRC-17	440	445	0.14	-99.0	10
MRC-17	445	450	-99.0	0.04	48.57
MRC-22	0	150	0.12	0.75	10
MRC-22	150	180	0.07	0.15	18
MRC-23	0	55	0.04	0.16	20

All I have is the below so far that I want to share, the rest I've tried isn't great and didn't work (got blank for the result). I think maybe I have to use mask since I used that before in something else (with the help of someone on here) but I'm unsure how to make this work.

# Load libraries

import pandas as pd
import numpy as np
from scipy import stats

df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')

Any help is greatly appreciated!

Reinderien · Accepted Answer

A fairly straightforward method that does a group and merge per column of interest:

from io import StringIO

import numpy as np
import pandas as pd

s1 = StringIO('''ID     From    To  Q   RM  RQ
MRC-17  400     405     0.01    -99.0   10
MRC-17  405     410     0.15    -99.0   10
MRC-17  410     415     0.21    0.8     20
MRC-17  415     420     0.20    -99.0   10
MRC-17  420     425     -99.0   -99.0   10
MRC-17  425     430     -99.0   -99.0   10
MRC-17  430     435     -99.0   -99.0   10
MRC-17  435     440     -99.0   -99.0   26.67
MRC-17  440     445     0.14    -99.0   10
MRC-17  445     450     -99.0   0.04    48.57
''')
s2 = StringIO('''ID     From    To  Q   RM  RQ
MRC-17  400     420     0.15    0.01    10
MRC-17  420     440     0.33    0.8     -99.0
MRC-17  440     460     0.21    0.8     20
''')

df1 = pd.read_csv(s1, delim_whitespace=True)
df2 = pd.read_csv(s2, delim_whitespace=True)

'''
we trust df1 over df2 
if df1 is missing for a run of 20 (e.g. 420-440) then df2 overrides the -99 (missing) in df1
'''

for df in (df1, df2):
    df.replace(-99, np.nan, inplace=True)

df1['From20'] = (df1.From // 20) * 20
grouped = df1.groupby('From20')
for col in ('Q', 'RM'):
    needs_filling = ~grouped[col].any()
    idx = pd.Series(needs_filling.index[needs_filling], name='From')
    right = pd.merge(
        left=idx,       right=df2[['From', col]],
        left_on='From', right_on='From',
    )
    merged = pd.merge(
        how='left',
        left=df1.From20,  right=right,
        left_on='From20', right_on='From',
    )
    df1[col].fillna(merged[col], inplace=True)

print(df1)

       ID  From   To     Q    RM     RQ  From20
0  MRC-17   400  405  0.01   NaN  10.00     400
1  MRC-17   405  410  0.15   NaN  10.00     400
2  MRC-17   410  415  0.21  0.80  20.00     400
3  MRC-17   415  420  0.20   NaN  10.00     400
4  MRC-17   420  425  0.33  0.80  10.00     420
5  MRC-17   425  430  0.33  0.80  10.00     420
6  MRC-17   430  435  0.33  0.80  10.00     420
7  MRC-17   435  440  0.33  0.80  26.67     420
8  MRC-17   440  445  0.14   NaN  10.00     440
9  MRC-17   445  450   NaN  0.04  48.57     440

How to merge two dataframes with overlapping data and special requirements?

Answers (1)

Related Questions