Nico
Nico

Reputation: 45

How do I merge two data frames if a column and it's values exist in both data frames?

How do I merge two data frames if a column has values in both data frames?

Data frame 1 – df1

    A   B   C
0   'name1' 'foo'   'bar'
1   'name2' 'foo''  'bar'
2   'name3' 'foo'   NaN
3   'name4' 'foo'   NaN

Data frame 2 – df2

    A   C
0   'name3' 'bar'
1   'name4' 'bar'

I tried this:

pd.merge(df1, df2, how='left', on='A')

But that does not create the desired output.

Here is a full minimum example:

import pandas as pd
from io import StringIO
csv1 = """A,B,C
'name1','foo','bar'
'name2','foo'','bar'
'name3','foo',
'name4','foo',
"""

csv2 = """A,C
'name3','bar'
'name4','bar'
"""
df1 = pd.read_csv(StringIO(csv1), header=0)
df2 = pd.read_csv(StringIO(csv2), header=0)

pd.merge(df1, df2, how='left', on='A')

I want this:

    A   B   C
0   'name1' 'foo'   'bar'
1   'name2' 'foo''  'bar'
2   'name3' 'foo'   'bar'
3   'name4' 'foo'   'bar'

But get this:

    A   B   C_x C_y
0   'name1' 'foo'   'bar'   NaN
1   'name2' 'foo''  'bar'   NaN
2   'name3' 'foo'   NaN 'bar'
3   'name4' 'foo'   NaN 'bar'

Upvotes: 3

Views: 99

Answers (3)

prosti
prosti

Reputation: 46291

I tried like this if you need to use merge:

df3= pd.merge(df, df2, how='left',  on=['A','C'])
print(df3)

         A       B      C
0  'name1'   'foo'  'bar'
1  'name2'  'foo''  'bar'
2  'name3'   'foo'    NaN
3  'name4'   'foo'    NaN
         A      C
0  'name3'  'bar'
1  'name4'  'bar'
         A       B      C
0  'name1'   'foo'  'bar'
1  'name2'  'foo''  'bar'
2  'name3'   'foo'    NaN
3  'name4'   'foo'    NaN

Other answers are much better, but definitively you need to use inner join (default for merge), and possible to ffill the C column, which would be simple as df3.C = df3.C.ffill()

Upvotes: 0

Erfan
Erfan

Reputation: 42886

If the real case data doesn't have the NaN on the same location, combine_first might fail, in that case you can use pd.merge and Series.fillna:

df = pd.merge(df1, df2, how='left', on='A', suffixes=['', '_2'])
df['C'].fillna(df['C_2'], inplace=True)
df.drop('C_2', axis=1, inplace=True)

         A      B      C
0  'name1'  'foo'  'bar'
1  'name2'  'foo'  'bar'
2  'name3'  'foo'  'bar'
3  'name4'  'foo'  'bar'

Upvotes: 2

anky
anky

Reputation: 75080

IIUC, you can do df.combine_first() which

Updates null elements with value in the same location in other.

df1.set_index('A').combine_first(df2.set_index('A')).reset_index()

        A       B      C
0  'name1'   'foo'  'bar'
1  'name2'  'foo''  'bar'
2  'name3'   'foo'  'bar'
3  'name4'   'foo'  'bar'

Upvotes: 5

Related Questions