Kenan
Kenan

Reputation: 14094

Pandas merge dataframes with shared column, fillna in left with right

I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?

# Example data (my actual df is ~500k rows x 11 cols)
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})

# Merge the dataframes...
df = df1.merge(df2, on='a', how='left')

# Fillna in 'b' column of left df with right df...
df['b'] = df['b_x'].fillna(df['b_y'])

# Drop the columns no longer needed
df = df.drop(['b_x', 'b_y'], axis=1)

Update 2023

For not nan but still similar update

df1 = df1.set_index('a')
df2 = df2.set_index('a')
df1.update(df2)
df1.reset_index(inplace=True)

Upvotes: 20

Views: 53608

Answers (5)

Carl F. Corneil
Carl F. Corneil

Reputation: 192

These answers didnt work for me in pandas 1.5.3, but by manipulating the suffixes a bit, I got this to work:

df1.fillna(df1.merge(df2, how="left", on="a", suffixes=["_old", ""]))

This results in this output:

    a   b   e
0   1   0.0 a
1   2   1.0 1
2   3   0.0 2
3   4   1.0 b

Upvotes: 1

BENY
BENY

Reputation: 323226

The data

df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})

Short version

df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Since you mentioned there will be multiple columns

df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Also we can pass to fillna with df

df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Upvotes: 6

piRSquared
piRSquared

Reputation: 294228

The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:

  • slice the non-shared columns 'a','e' from df1
  • do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
  • finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']

Solution:

df1.update(df1[['a', 'e']].merge(df2, 'left'))

df1

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  4  1.0  b
3  3  NaN  2

The result would be

df1.update(df1[['a', 'e']].merge(df2, 'left'))

df1

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  4  1.0  b
3  3  0.0  2

Which is as expected.


Further...

If you want to be more explicit when there may be more columns involved

df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))

Even Further...

If you don't want to update the dataframe, we can use combine_first

Quick

df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))

Explicit

df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))

EVEN FURTHER!...

The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:

df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)

Upvotes: 17

krewsayder
krewsayder

Reputation: 446

You can mask the data.

original data:

print(df)
   one  two  three
0    1  1.0    1.0
1    2  NaN    2.0
2    3  3.0    NaN

print(df2)
   one  two  three
0    4    4      4
1    4    2      4
2    4    4      3

See below, mask just fills based on condition.

# mask values where isna()
df1[['two','three']] = df1[['two','three']]\
        .mask(df1[['two','three']].isna(),df2[['two','three']])

output:

   one  two  three
0    1  1.0    1.0
1    2  2.0    2.0
2    3  3.0    3.0

Upvotes: 2

Erfan
Erfan

Reputation: 42886

Only if the indices are alligned (important note), we can use update:

df1['b'].update(df2['b'])


   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Or simply fillna:

df1['b'].fillna(df2['b'], inplace=True)

If you're indices are not alligned, see WenNYoBen's answer or comment underneath.

Upvotes: 3

Related Questions