mlisthenewcool
mlisthenewcool

Reputation: 569

pandas: merge with a key in different columns (merge using col1 OR col2)

df_a and df_b are two dataframes that looks like following. They do NOT have same length.

--> df_a
col_1  col_2
a      b
c
d
       e
f      g

--> df_b
col_3
a
c
d
e

I know I can achieve a merge with AND clause with :

df_a = pd.Dataframe({"col_1": ["a", "c", "d", None, "f"], "col_2": ["b", None, None, "e", "g"]})
df_b = pd.Dataframe({"col_3": ["a", "c", "d", "e"]})

pd.merge(df_a, df_b, how="inner", left_on=["col_1", "col_2"], right_on=["col_3", "col_3"])

My need is to merged them based on a OR clause like col_1 == col_3 OR col_2 == col_3. Ideally I do not want to make two merges separately.

Appreciate any suggestions.

Upvotes: 2

Views: 265

Answers (1)

mozway
mozway

Reputation: 261974

Use a double left merge, first with 'col_1', then with 'col_2' and combine_first the two outputs:

(df_a
 .merge(df_b, left_on='col_1', right_on='col_3', how='left')
 .combine_first(df_a.merge(df_b, left_on='col_2', right_on='col_3', how='left'))
)

output:

  col_1 col_2 col_3  col_4
0     a     b     a    0.0
1     c  None     c    1.0
2     d  None     d    2.0
3  None     e     e    3.0
4     f     g   NaN    NaN

used input (with extra column):

df_a = pd.DataFrame({"col_1": ["a", "c", "d", None, "f"],
                     "col_2": ["b", None, None, "e", "g"]})
df_b = pd.DataFrame({"col_3": ["a", "c", "d", "e"], "col_4": range(4)})
arbitrary number of columns

you can use functools.reduce and a list of columns in order of the preferred match (modified input to see the difference, here 'col_2' has priority over 'col_1'):

cols = ['col_2', 'col_1']

from functools import reduce

out = reduce(lambda a,b: a.combine_first(b),
             [df_a.merge(df_b, left_on=c, right_on='col_3', how='left')
              for c in cols])

output:

  col_1 col_2 col_3  col_4
0     a     c     c    1.0
1     c  None     c    1.0
2     d  None     d    2.0
3  None     e     e    3.0
4     f     g   NaN    NaN

modified input:

df_a = pd.DataFrame({"col_1": ["a", "c", "d", None, "f"],
                     "col_2": ["c", None, None, "e", "g"]})
df_b = pd.DataFrame({"col_3": ["a", "c", "d", "e"], "col_4": range(4)})

Upvotes: 2

Related Questions