Reputation: 45
How do I merge two data frames if a column has values in both data frames?
Data frame 1 – df1
A B C
0 'name1' 'foo' 'bar'
1 'name2' 'foo'' 'bar'
2 'name3' 'foo' NaN
3 'name4' 'foo' NaN
Data frame 2 – df2
A C
0 'name3' 'bar'
1 'name4' 'bar'
I tried this:
pd.merge(df1, df2, how='left', on='A')
But that does not create the desired output.
Here is a full minimum example:
import pandas as pd
from io import StringIO
csv1 = """A,B,C
'name1','foo','bar'
'name2','foo'','bar'
'name3','foo',
'name4','foo',
"""
csv2 = """A,C
'name3','bar'
'name4','bar'
"""
df1 = pd.read_csv(StringIO(csv1), header=0)
df2 = pd.read_csv(StringIO(csv2), header=0)
pd.merge(df1, df2, how='left', on='A')
I want this:
A B C
0 'name1' 'foo' 'bar'
1 'name2' 'foo'' 'bar'
2 'name3' 'foo' 'bar'
3 'name4' 'foo' 'bar'
But get this:
A B C_x C_y
0 'name1' 'foo' 'bar' NaN
1 'name2' 'foo'' 'bar' NaN
2 'name3' 'foo' NaN 'bar'
3 'name4' 'foo' NaN 'bar'
Upvotes: 3
Views: 99
Reputation: 46291
I tried like this if you need to use merge
:
df3= pd.merge(df, df2, how='left', on=['A','C'])
print(df3)
A B C
0 'name1' 'foo' 'bar'
1 'name2' 'foo'' 'bar'
2 'name3' 'foo' NaN
3 'name4' 'foo' NaN
A C
0 'name3' 'bar'
1 'name4' 'bar'
A B C
0 'name1' 'foo' 'bar'
1 'name2' 'foo'' 'bar'
2 'name3' 'foo' NaN
3 'name4' 'foo' NaN
Other answers are much better, but definitively you need to use inner join (default for merge), and possible to ffill
the C column, which would be simple as df3.C = df3.C.ffill()
Upvotes: 0
Reputation: 42886
If the real case data doesn't have the NaN
on the same location, combine_first
might fail, in that case you can use pd.merge
and Series.fillna
:
df = pd.merge(df1, df2, how='left', on='A', suffixes=['', '_2'])
df['C'].fillna(df['C_2'], inplace=True)
df.drop('C_2', axis=1, inplace=True)
A B C
0 'name1' 'foo' 'bar'
1 'name2' 'foo' 'bar'
2 'name3' 'foo' 'bar'
3 'name4' 'foo' 'bar'
Upvotes: 2
Reputation: 75080
IIUC, you can do df.combine_first()
which
Updates null elements with value in the same location in other.
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 'name1' 'foo' 'bar'
1 'name2' 'foo'' 'bar'
2 'name3' 'foo' 'bar'
3 'name4' 'foo' 'bar'
Upvotes: 5