Merge two dataframes on multiple columns but only merge on columns if both not NaN

Question

I'm looking to merge two dataframes across multiple columns but with some additional conditions.

import pandas as pd
df1 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X',None,'Z','V'],
    'optional_col3': [None,'def', 'ghi','jkl']
})

df2 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X','Y','Z','W'],
    'optional_col3': ['abc', 'def', 'ghi','mno']
})

I would like to always join on col1 but then try to also join on optional_col2 and optional_col3. In df1, the value can be NaN for both columns but it is always populated in df2. I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.

This would result in ['a', 'b', 'c'] joining due to exact col2, col3, and exact match, respectively.

In SQL I suppose you could write the join as this, if it helps explain further:

select
    *
from
    df1
        inner join
    df2
        on df1.col1 = df2.col2
        AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)

I've messed around with pd.merge but can't figure how to do a complex operation like this. I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?

Expected DataFrame would be something like:

merged_df = pd.DataFrame({
    'col1': ['a', 'b', 'c'],
    'optional_col_2': ['X', 'Y', 'Z'],
    'optional_col_3': ['abc', 'def', 'ghi']
})

Merge two dataframes on multiple columns but only merge on columns if both not NaN

Answers (1)

Related Questions