Mittenchops
Mittenchops

Reputation: 19664

Combine two columns while giving priority to the first one

From this question, I have two matrices and am looking to merge them in such a way that I left join dfB onto dfA replacing NaN values with non-NaN values wherever I have them.

That is,

>>> dfA
  s_name  geo    zip  date value
0      A  zip  60601  2010   NaN  # In the earlier question, this was None
1      B  zip  60601  2010   NaN  # rather than NaN, which was
2      C  zip  60601  2010   NaN  # a mistake.
3      D  zip  60601  2010   NaN

>>> dfB
  s_name  geo    zip  date  value
0      A  zip  60601  2010    1.0
1      B  zip  60601  2010    NaN
3      D  zip  60601  2010    4.0

Merging them, I see:

>>> new = pd.merge(dfA,dfB,on=["s_name","geo", "geoid", "date"],how="left")
>>> new.head()
  name    geo   zip  date  value_x  value_y
0    A  state    01  2009      NaN      1.0
1    B  state    01  2010      NaN      NaN
2    C  state    01  2011      NaN      NaN
3    D  state    01  2012      NaN      4.0
4    E  state    01  2013      NaN      5.0

I can't be sure value_y is always numbered and value_x is always NaN. But I want a merged value, call it value that is whichever-value-is-not-NaN. I try this:

>>> new["value"] = new.apply(lambda r: r.value_x or r.value_y, axis=1)
>>> new.head()
  name    geo   zip  date  value_x  value_y  value
0    A  state    01  2009      NaN      1.0    NaN
1    B  state    01  2010      NaN      NaN    NaN
2    C  state    01  2011      NaN      NaN    NaN
3    D  state    01  2012      NaN      4.0    NaN
4    E  state    01  2013      NaN      5.0    NaN

Oh no.

It makes sense in that NaN should propagate, but is not what I'm looking for. I'd like logic that would return whichever is present, not return NaN if either is present.

I'd like the logic that None gives me. You can see:

>>> new["value_z"] = None
>>> new.head()
  name    geo   zip  date  value_x  value_y  value value_z
0    A  state    01  2009      NaN      1.0    NaN    None
1    B  state    01  2010      NaN      NaN    NaN    None
2    C  state    01  2011      NaN      NaN    NaN    None
3    D  state    01  2012      NaN      4.0    NaN    None
4    E  state    01  2013      NaN      5.0    NaN    None

>>> new["value2"] = new.apply(lambda r: r.value_z or r.value_y, axis=1)
>>> new.head()
  name    geo   zip  date  value_x  value_y  value value_z   value2
0    A  state    01  2009      NaN      1.0    NaN    None      1.0
1    B  state    01  2010      NaN      NaN    NaN    None      NaN
2    C  state    01  2011      NaN      NaN    NaN    None      NaN
3    D  state    01  2012      NaN      4.0    NaN    None      4.0
4    E  state    01  2013      NaN      5.0    NaN    None      5.0

The logic that creates value2 is the behavior I'm looking for, not value.

What's the best way to do this?

Upvotes: 2

Views: 1268

Answers (3)

anky
anky

Reputation: 75080

if you have a preference for value_x , you could try:

df.value_x = df.value_x.fillna(df.value_y)
df.pop('value_y')

or :

df.value_x=df.value_x.fillna(df.pop('value_y'))

>>df
   name geo    zip  date    value_x
0   A   state   1   2009    1.0
1   B   state   1   2010    NaN
2   C   state   1   2011    NaN
3   D   state   1   2012    4.0
4   E   state   1   2013    5.0

Upvotes: 4

cs95
cs95

Reputation: 402483

combine_first will work after merge:

dfC = pd.merge(dfA, dfB, on=["s_name", "geo", "zip", "date"], how="left")
dfC['value'] = dfC.pop('value_x').combine_first(dfC.pop('value_y'))
dfC

  s_name  geo    zip  date  value
0      A  zip  60601  2010    1.0
1      B  zip  60601  2010    NaN
2      C  zip  60601  2010    NaN
3      D  zip  60601  2010    4.0

combine_first gives preference to "value_x" over "value_y". You can also write this as:

dfC = pd.merge(dfA, dfB, on=["s_name", "geo", "zip", "date"], how="left")
dfC['value_x'] = dfC['value_x'].combine_first(dfC.pop('value_y'))
dfC

  s_name  geo    zip  date  value_x
0      A  zip  60601  2010      1.0
1      B  zip  60601  2010      NaN
2      C  zip  60601  2010      NaN
3      D  zip  60601  2010      4.0

Upvotes: 3

Mittenchops
Mittenchops

Reputation: 19664

This technically works by hammering out the logic, but is ugly and feels like a hack (I believe it gives preference to value_x due to operator short-circuiting?):

>>> new["value3"] = new.apply(lambda r: (not(pd.isna(r.value_x)) or r.value_y) or (r.value_x or not(pd.isna(r.value_y))), axis=1)

>>> new.head()
  name    geo   zip  date  value_x  value_y  value value_z   value2 value3
0    A  state    01  2009      NaN      1.0    NaN    None      1.0    1.0
1    B  state    01  2010      NaN      NaN    NaN    None      NaN    NaN
2    C  state    01  2011      NaN      NaN    NaN    None      NaN    NaN
3    D  state    01  2012      NaN      4.0    NaN    None      4.0    4.0
4    E  state    01  2013      NaN      5.0    NaN    None      5.0    5.0

Upvotes: 0

Related Questions