Reputation: 19664
From this question, I have two matrices and am looking to merge them in such a way that I left join dfB onto dfA replacing NaN values with non-NaN values wherever I have them.
That is,
>>> dfA
s_name geo zip date value
0 A zip 60601 2010 NaN # In the earlier question, this was None
1 B zip 60601 2010 NaN # rather than NaN, which was
2 C zip 60601 2010 NaN # a mistake.
3 D zip 60601 2010 NaN
>>> dfB
s_name geo zip date value
0 A zip 60601 2010 1.0
1 B zip 60601 2010 NaN
3 D zip 60601 2010 4.0
Merging them, I see:
>>> new = pd.merge(dfA,dfB,on=["s_name","geo", "geoid", "date"],how="left")
>>> new.head()
name geo zip date value_x value_y
0 A state 01 2009 NaN 1.0
1 B state 01 2010 NaN NaN
2 C state 01 2011 NaN NaN
3 D state 01 2012 NaN 4.0
4 E state 01 2013 NaN 5.0
I can't be sure value_y is always numbered and value_x is always NaN. But I want a merged value, call it value
that is whichever-value-is-not-NaN. I try this:
>>> new["value"] = new.apply(lambda r: r.value_x or r.value_y, axis=1)
>>> new.head()
name geo zip date value_x value_y value
0 A state 01 2009 NaN 1.0 NaN
1 B state 01 2010 NaN NaN NaN
2 C state 01 2011 NaN NaN NaN
3 D state 01 2012 NaN 4.0 NaN
4 E state 01 2013 NaN 5.0 NaN
Oh no.
It makes sense in that NaN should propagate, but is not what I'm looking for. I'd like logic that would return whichever is present, not return NaN if either is present.
I'd like the logic that None gives me. You can see:
>>> new["value_z"] = None
>>> new.head()
name geo zip date value_x value_y value value_z
0 A state 01 2009 NaN 1.0 NaN None
1 B state 01 2010 NaN NaN NaN None
2 C state 01 2011 NaN NaN NaN None
3 D state 01 2012 NaN 4.0 NaN None
4 E state 01 2013 NaN 5.0 NaN None
>>> new["value2"] = new.apply(lambda r: r.value_z or r.value_y, axis=1)
>>> new.head()
name geo zip date value_x value_y value value_z value2
0 A state 01 2009 NaN 1.0 NaN None 1.0
1 B state 01 2010 NaN NaN NaN None NaN
2 C state 01 2011 NaN NaN NaN None NaN
3 D state 01 2012 NaN 4.0 NaN None 4.0
4 E state 01 2013 NaN 5.0 NaN None 5.0
The logic that creates value2
is the behavior I'm looking for, not value
.
What's the best way to do this?
Upvotes: 2
Views: 1268
Reputation: 75080
if you have a preference for value_x
, you could try:
df.value_x = df.value_x.fillna(df.value_y)
df.pop('value_y')
or :
df.value_x=df.value_x.fillna(df.pop('value_y'))
>>df
name geo zip date value_x
0 A state 1 2009 1.0
1 B state 1 2010 NaN
2 C state 1 2011 NaN
3 D state 1 2012 4.0
4 E state 1 2013 5.0
Upvotes: 4
Reputation: 402483
combine_first
will work after merge
:
dfC = pd.merge(dfA, dfB, on=["s_name", "geo", "zip", "date"], how="left")
dfC['value'] = dfC.pop('value_x').combine_first(dfC.pop('value_y'))
dfC
s_name geo zip date value
0 A zip 60601 2010 1.0
1 B zip 60601 2010 NaN
2 C zip 60601 2010 NaN
3 D zip 60601 2010 4.0
combine_first
gives preference to "value_x" over "value_y". You can also write this as:
dfC = pd.merge(dfA, dfB, on=["s_name", "geo", "zip", "date"], how="left")
dfC['value_x'] = dfC['value_x'].combine_first(dfC.pop('value_y'))
dfC
s_name geo zip date value_x
0 A zip 60601 2010 1.0
1 B zip 60601 2010 NaN
2 C zip 60601 2010 NaN
3 D zip 60601 2010 4.0
Upvotes: 3
Reputation: 19664
This technically works by hammering out the logic, but is ugly and feels like a hack (I believe it gives preference to value_x due to operator short-circuiting?):
>>> new["value3"] = new.apply(lambda r: (not(pd.isna(r.value_x)) or r.value_y) or (r.value_x or not(pd.isna(r.value_y))), axis=1)
>>> new.head()
name geo zip date value_x value_y value value_z value2 value3
0 A state 01 2009 NaN 1.0 NaN None 1.0 1.0
1 B state 01 2010 NaN NaN NaN None NaN NaN
2 C state 01 2011 NaN NaN NaN None NaN NaN
3 D state 01 2012 NaN 4.0 NaN None 4.0 4.0
4 E state 01 2013 NaN 5.0 NaN None 5.0 5.0
Upvotes: 0