Erin W
Erin W

Reputation: 49

Join two Pandas Dataframes keeping specific string in specific column

I have two Dataframes that I'm trying to merge/join/concatenate (I'm not sure the right term). I don't care about the index. ID is the unique identifier for each row. There are a LOT of columns of data (simplified here to A-D) but for each unique ID, the columns of data will be the same between Dataframes EXCEPT for the final QC columns.

I want to join the two Dataframes, such that when there are duplicate entries (as determined by a duplication in the ID column), any instance of most of the rows is kept (first or last) but the value kept for QC_1 and QC_2 is where there is actually a value (in this case I've used the string 'Fail' but I could switch to a Bool and keep True if that makes this any easier).

I've tried iterations of .merge and .join. The closest I've gotten is with .concat or .append but then I can't figure out how to have the duplicated rows combined into one. Essentially, I'm at a loss.

df1

Index  ID    A     B     C     D    QC_1  
   3    13   10    15    17    100   Fail 
   4    17   20    25    27    110   Fail 
   7    42   30    35    37    120   Fail  
   12   115  40    45    47    130   Fail 

df2

Index  ID    A     B     C     D    QC_2 
   2    6    11    16    18    101   Fail 
   4    17   20    25    27    110   Fail 
   7    42   30    35    37    120   Fail 
   13   152  41    46    48    131   Fail  

goal

Index  ID    A     B     C     D    QC_1  QC_2
   3    13   10    15    17    100   Fail   NaN
   4    17   20    25    27    110   Fail   Fail
   7    42   30    35    37    120   Fail   Fail
   12   115  40    45    47    130   Fail   NaN
   2    6    11    16    18    101   NaN   Fail 
   13   152  41    46    48    131   NaN   Fail 

Upvotes: 0

Views: 122

Answers (3)

jfaccioni
jfaccioni

Reputation: 7529

Try this:

cols = list(df1.columns[:-1])
pd.merge(df1, df2, on=cols, how='outer')

Here I assume that you want to compare using every column of df1 except for the last (QC_1). Adapt the cols variable to your need.

Upvotes: 0

Nathan e
Nathan e

Reputation: 81

This does the job.

import pandas as pd


df1 = pd.DataFrame({'ID':[13,17,42,115], 
                    'A':[10,20,30,40], 
                    'B':[15,25,35,45], 
                    'C': list(range(17,48,10)), 
                    'D':list(range(100, 131,10)),
                    'QC_1':4*['Fail']
                    })

df2 = pd.DataFrame({'ID':[6,17,42,152], 
                    'A':[11,20,30,41], 
                    'B':[16,25,35,46], 
                    'C': [18,27,37,48], 
                    'D':[101,110,120,131],
                    'QC_2':4*['Fail']
                    })

result = df1.merge(
    df2,
    how='outer', 
    left_on=['ID', 'A', 'B', 'C', 'D'], 
    right_on=['ID', 'A', 'B', 'C', 'D']
    )

Upvotes: 0

Henry Yik
Henry Yik

Reputation: 22503

Use combine_first:

print (df1.set_index("ID").combine_first(df2.set_index("ID")).reset_index())

    ID     A     B     C      D  Index  QC_1  QC_2
0    6  11.0  16.0  18.0  101.0    2.0   NaN  Fail
1   13  10.0  15.0  17.0  100.0    3.0  Fail   NaN
2   17  20.0  25.0  27.0  110.0    4.0  Fail  Fail
3   42  30.0  35.0  37.0  120.0    7.0  Fail  Fail
4  115  40.0  45.0  47.0  130.0   12.0  Fail   NaN
5  152  41.0  46.0  48.0  131.0   13.0   NaN  Fail

Upvotes: 3

Related Questions