Reputation: 45
I have two dataframes with various columns (sometimes the same, sometimes mostly the same--could have extra columns). I know this can be done with SQL or something else but I want to try to do it with Python. Also it isn't only one ID this is just an example of one of the IDs in the data.
Rules are: we trust df1 over df2 but if df1 is missing for a run of 20 (420-440) then I want df2 to override the -99 (missing) in df1. I don't want any override in df1 if its not 20 feet. it has to be 20 ft like this though (400-405, 405-410, 410-415, 415-420) so if there is a way to state what "origin" it starts from on the From/To that would be helpful.
df1
ID | From | To | Q | RM | RQ |
---|---|---|---|---|---|
MRC-10 | 0 | 5 | 0.005 | 15.0 | 10 |
MRC-10 | 5 | 10 | 0.5 | 1.8 | 10 |
MRC-10 | 10 | 15 | 0.21 | 10.7 | 20 |
MRC-10 | 15 | 20 | 0.20 | -99.0 | 10 |
MRC-17 | 400 | 405 | 0.01 | -99.0 | 10 |
MRC-17 | 405 | 410 | 0.15 | -99.0 | 10 |
MRC-17 | 410 | 415 | 0.21 | 0.8 | 20 |
MRC-17 | 415 | 420 | 0.20 | -99.0 | 10 |
MRC-17 | 420 | 425 | -99.0 | -99.0 | 10 |
MRC-17 | 425 | 430 | -99.0 | -99.0 | 10 |
MRC-17 | 430 | 435 | -99.0 | -99.0 | 10 |
MRC-17 | 435 | 440 | -99.0 | -99.0 | 26.67 |
MRC-17 | 440 | 445 | 0.14 | -99.0 | 10 |
MRC-17 | 445 | 450 | -99.0 | 0.04 | 48.57 |
MRC-15 | 100 | 105 | 10.1 | -99.0 | 10 |
MRC-15 | 105 | 110 | 1.5 | -99.0 | 10 |
MRC-15 | 110 | 115 | 8.8 | 0.8 | 20 |
MRC-15 | 115 | 120 | 7.3 | -99.0 | 10 |
MRC-15 | 120 | 125 | -99.0 | -99.0 | 10 |
MRC-15 | 125 | 130 | -99.0 | -99.0 | 10 |
MRC-15 | 130 | 135 | -99.0 | -99.0 | 10 |
MRC-15 | 135 | 140 | -99.0 | -99.0 | 26.67 |
MRC-15 | 140 | 145 | 15.4 | -99.0 | 10 |
MRC-15 | 145 | 150 | -99.0 | 0.04 | 48.57 |
df2
ID | From | To | Q | RM | RQ |
---|---|---|---|---|---|
MRC-22 | 0 | 150 | 0.12 | 0.75 | 10 |
MRC-22 | 150 | 180 | 0.07 | 0.15 | 18 |
MRC-23 | 0 | 55 | 0.04 | 0.16 | 20 |
MRC-17 | 400 | 420 | 0.15 | 0.01 | 10 |
MRC-17 | 420 | 440 | 0.33 | 0.8 | -99.0 |
MRC-17 | 440 | 460 | 0.21 | 0.8 | 20 |
MRC-15 | 100 | 120 | 0.52 | 0.7 | 10 |
MRC-15 | 120 | 140 | 0.64 | 0.15 | 10 |
MRC-15 | 140 | 160 | 0.88 | 0.82 | 20 |
Resulting in Final (-99 means missing for numeric, X for char):
ID | From | To | Q | RM | RQ |
---|---|---|---|---|---|
MRC-10 | 0 | 5 | 0.005 | 15.0 | 10 |
MRC-10 | 5 | 10 | 0.5 | 1.8 | 10 |
MRC-10 | 10 | 15 | 0.21 | 10.7 | 20 |
MRC-10 | 15 | 20 | 0.20 | -99.0 | 10 |
MRC-15 | 100 | 105 | 10.1 | -99.0 | 10 |
MRC-15 | 105 | 110 | 1.5 | -99.0 | 10 |
MRC-15 | 110 | 115 | 8.8 | 0.8 | 20 |
MRC-15 | 115 | 120 | 7.3 | -99.0 | 10 |
MRC-15 | 120 | 125 | 0.64 | 0.15 | 10 |
MRC-15 | 125 | 130 | 0.64 | 0.15 | 10 |
MRC-15 | 130 | 135 | 0.64 | 0.15 | 10 |
MRC-15 | 135 | 140 | 0.64 | 0.15 | 26.67 |
MRC-15 | 140 | 145 | 15.4 | -99.0 | 10 |
MRC-15 | 145 | 150 | -99.0 | 0.04 | 48.57 |
MRC-17 | 400 | 405 | 0.01 | -99.0 | 10 |
MRC-17 | 405 | 410 | 0.15 | -99.0 | 10 |
MRC-17 | 410 | 415 | 0.21 | 0.8 | 20 |
MRC-17 | 415 | 420 | 0.20 | -99.0 | 10 |
MRC-17 | 420 | 440 | 0.33 | 0.8 | 10 |
MRC-17 | 425 | 430 | 0.33 | 0.8 | 10 |
MRC-17 | 430 | 435 | 0.33 | 0.8 | 10 |
MRC-17 | 435 | 440 | 0.33 | 0.8 | 26.67 |
MRC-17 | 440 | 445 | 0.14 | -99.0 | 10 |
MRC-17 | 445 | 450 | -99.0 | 0.04 | 48.57 |
MRC-22 | 0 | 150 | 0.12 | 0.75 | 10 |
MRC-22 | 150 | 180 | 0.07 | 0.15 | 18 |
MRC-23 | 0 | 55 | 0.04 | 0.16 | 20 |
All I have is the below so far that I want to share, the rest I've tried isn't great and didn't work (got blank for the result). I think maybe I have to use mask since I used that before in something else (with the help of someone on here) but I'm unsure how to make this work.
# Load libraries
import pandas as pd
import numpy as np
from scipy import stats
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
Any help is greatly appreciated!
Upvotes: 0
Views: 87
Reputation: 15273
A fairly straightforward method that does a group and merge per column of interest:
from io import StringIO
import numpy as np
import pandas as pd
s1 = StringIO('''ID From To Q RM RQ
MRC-17 400 405 0.01 -99.0 10
MRC-17 405 410 0.15 -99.0 10
MRC-17 410 415 0.21 0.8 20
MRC-17 415 420 0.20 -99.0 10
MRC-17 420 425 -99.0 -99.0 10
MRC-17 425 430 -99.0 -99.0 10
MRC-17 430 435 -99.0 -99.0 10
MRC-17 435 440 -99.0 -99.0 26.67
MRC-17 440 445 0.14 -99.0 10
MRC-17 445 450 -99.0 0.04 48.57
''')
s2 = StringIO('''ID From To Q RM RQ
MRC-17 400 420 0.15 0.01 10
MRC-17 420 440 0.33 0.8 -99.0
MRC-17 440 460 0.21 0.8 20
''')
df1 = pd.read_csv(s1, delim_whitespace=True)
df2 = pd.read_csv(s2, delim_whitespace=True)
'''
we trust df1 over df2
if df1 is missing for a run of 20 (e.g. 420-440) then df2 overrides the -99 (missing) in df1
'''
for df in (df1, df2):
df.replace(-99, np.nan, inplace=True)
df1['From20'] = (df1.From // 20) * 20
grouped = df1.groupby('From20')
for col in ('Q', 'RM'):
needs_filling = ~grouped[col].any()
idx = pd.Series(needs_filling.index[needs_filling], name='From')
right = pd.merge(
left=idx, right=df2[['From', col]],
left_on='From', right_on='From',
)
merged = pd.merge(
how='left',
left=df1.From20, right=right,
left_on='From20', right_on='From',
)
df1[col].fillna(merged[col], inplace=True)
print(df1)
ID From To Q RM RQ From20
0 MRC-17 400 405 0.01 NaN 10.00 400
1 MRC-17 405 410 0.15 NaN 10.00 400
2 MRC-17 410 415 0.21 0.80 20.00 400
3 MRC-17 415 420 0.20 NaN 10.00 400
4 MRC-17 420 425 0.33 0.80 10.00 420
5 MRC-17 425 430 0.33 0.80 10.00 420
6 MRC-17 430 435 0.33 0.80 10.00 420
7 MRC-17 435 440 0.33 0.80 26.67 420
8 MRC-17 440 445 0.14 NaN 10.00 440
9 MRC-17 445 450 NaN 0.04 48.57 440
Upvotes: 1