TomNash
TomNash

Reputation: 3288

Merge two dataframes on multiple columns but only merge on columns if both not NaN

I'm looking to merge two dataframes across multiple columns but with some additional conditions.

import pandas as pd
df1 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X',None,'Z','V'],
    'optional_col3': [None,'def', 'ghi','jkl']
})

df2 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X','Y','Z','W'],
    'optional_col3': ['abc', 'def', 'ghi','mno']
})

I would like to always join on col1 but then try to also join on optional_col2 and optional_col3. In df1, the value can be NaN for both columns but it is always populated in df2. I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.

This would result in ['a', 'b', 'c'] joining due to exact col2, col3, and exact match, respectively.

In SQL I suppose you could write the join as this, if it helps explain further:

select
    *
from
    df1
        inner join
    df2
        on df1.col1 = df2.col2
        AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)

I've messed around with pd.merge but can't figure how to do a complex operation like this. I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?

Expected DataFrame would be something like:

merged_df = pd.DataFrame({
    'col1': ['a', 'b', 'c'],
    'optional_col_2': ['X', 'Y', 'Z'],
    'optional_col_3': ['abc', 'def', 'ghi']
})

Upvotes: 1

Views: 201

Answers (1)

Riley
Riley

Reputation: 2261

This solution works by creating an extra column called "temp" in both dataframes. In df11 it will be a column of true values. In df2 the values will be true if there is a match between either of the optional columns. I'm not clear whether you consider a NaN value to be matchable or not, if so then you need to fill in the NaNs of columns in df1 with values from df2 before comparing to fulfill your criteria around missing values (this is what is below). If this is not required then drop the fillna calls in the example below.

df1["temp"] = True
optional_col2_match = df1["optional_col2"].fillna(df2["optional_col2"]).eq(df2["optional_col2"])
optional_col3_match = df1["optional_col3"].fillna(df2["optional_col3"]).eq(df2["optional_col3"])
df2["temp"] = optional_col2_match | optional_col3_match

Then use the "temp" column in the merge, and then drop it - it has served its purpose

pd.merge(df1, df2, on=["col1", "temp"]).drop(columns="temp")

This gives the following result

  col1 optional_col2_x optional_col3_x optional_col2_y optional_col3_y
0    a               X             abc               X             abc
1    b               Y             def               Y             def
2    c               Z             ghi               Z             ghi

You will need to decide what to do here. In the example you gave there are no rows which match on just one of optional_col2 and optional_col2, which is why a 3 column solution looks reasonable. This won't generally be the case.

Upvotes: 2

Related Questions