Reputation: 49
I have two Dataframes that I'm trying to merge/join/concatenate (I'm not sure the right term). I don't care about the index. ID is the unique identifier for each row. There are a LOT of columns of data (simplified here to A-D) but for each unique ID, the columns of data will be the same between Dataframes EXCEPT for the final QC columns.
I want to join the two Dataframes, such that when there are duplicate entries (as determined by a duplication in the ID column), any instance of most of the rows is kept (first or last) but the value kept for QC_1 and QC_2 is where there is actually a value (in this case I've used the string 'Fail' but I could switch to a Bool and keep True if that makes this any easier).
I've tried iterations of .merge and .join. The closest I've gotten is with .concat or .append but then I can't figure out how to have the duplicated rows combined into one. Essentially, I'm at a loss.
df1
Index ID A B C D QC_1
3 13 10 15 17 100 Fail
4 17 20 25 27 110 Fail
7 42 30 35 37 120 Fail
12 115 40 45 47 130 Fail
df2
Index ID A B C D QC_2
2 6 11 16 18 101 Fail
4 17 20 25 27 110 Fail
7 42 30 35 37 120 Fail
13 152 41 46 48 131 Fail
goal
Index ID A B C D QC_1 QC_2
3 13 10 15 17 100 Fail NaN
4 17 20 25 27 110 Fail Fail
7 42 30 35 37 120 Fail Fail
12 115 40 45 47 130 Fail NaN
2 6 11 16 18 101 NaN Fail
13 152 41 46 48 131 NaN Fail
Upvotes: 0
Views: 122
Reputation: 7529
Try this:
cols = list(df1.columns[:-1])
pd.merge(df1, df2, on=cols, how='outer')
Here I assume that you want to compare using every column of df1
except for the last (QC_1
). Adapt the cols
variable to your need.
Upvotes: 0
Reputation: 81
This does the job.
import pandas as pd
df1 = pd.DataFrame({'ID':[13,17,42,115],
'A':[10,20,30,40],
'B':[15,25,35,45],
'C': list(range(17,48,10)),
'D':list(range(100, 131,10)),
'QC_1':4*['Fail']
})
df2 = pd.DataFrame({'ID':[6,17,42,152],
'A':[11,20,30,41],
'B':[16,25,35,46],
'C': [18,27,37,48],
'D':[101,110,120,131],
'QC_2':4*['Fail']
})
result = df1.merge(
df2,
how='outer',
left_on=['ID', 'A', 'B', 'C', 'D'],
right_on=['ID', 'A', 'B', 'C', 'D']
)
Upvotes: 0
Reputation: 22503
Use combine_first
:
print (df1.set_index("ID").combine_first(df2.set_index("ID")).reset_index())
ID A B C D Index QC_1 QC_2
0 6 11.0 16.0 18.0 101.0 2.0 NaN Fail
1 13 10.0 15.0 17.0 100.0 3.0 Fail NaN
2 17 20.0 25.0 27.0 110.0 4.0 Fail Fail
3 42 30.0 35.0 37.0 120.0 7.0 Fail Fail
4 115 40.0 45.0 47.0 130.0 12.0 Fail NaN
5 152 41.0 46.0 48.0 131.0 13.0 NaN Fail
Upvotes: 3