aensm
aensm

Reputation: 3607

pandas - merging with missing values

There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaNs with other NaNs:

>>> foo = DataFrame([
    ['a',1,2],
    ['b',4,5],
    ['c',7,8],
    [np.NaN,10,11]
], columns=['id','x','y'])

>>> bar = DataFrame([
    ['a',3],
    ['c',9],
    [np.NaN,12]
], columns=['id','z'])

>>> pd.merge(foo, bar, how='left', on='id')
Out[428]: 
    id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11  12

[4 rows x 4 columns]

This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!)

Is there a way to ignore missing values during a merge without first slicing them out?

Upvotes: 30

Views: 70359

Answers (4)

user3357177
user3357177

Reputation: 385

Another approach, which also keeps all rows if performing an outer join:

foo['id'] = foo.id.fillna('missing')
pd.merge(foo, bar, how='left', on='id')

Upvotes: 0

Lia
Lia

Reputation: 510

if do not need NaN in both left and right DF, use

pd.merge(foo.dropna(subset=['id']), bar.dropna(subset=['id']), how='left', on='id')

else if need NaN in left DF, use

pd.merge(foo, bar.dropna(subset=['id']), how='left', on='id')

Upvotes: 5

Yonas Kassa
Yonas Kassa

Reputation: 3710

If You want to preserve the NaNs from both tables without slicing them out, you could use the outer join method as follows:

pd.merge(foo, bar.dropna(subset=['id']), how='outer', on='id')

It basically returns the union of foo and bar

Upvotes: 7

meloncholy
meloncholy

Reputation: 2192

You could exclude values from bar (and indeed foo if you wanted) where id is null during the merge. Not sure it's what you're after, though, as they are sliced out.

(I've assumed from your left join that you're interested in retaining all of foo, but only want to merge the parts of bar that match and are not null.)

foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

Out[11]: 
id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11 NaN

Upvotes: 13

Related Questions