Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

Pandas - fill a column with value from another column only when MULTIPLE COLUMNs are null

I have a Pandas DataFrame like this:

   a      b    c      x    a1     b1     c1     x1
0  aa    ba   ca      9    NaN    NaN    NaN    1
1  ab    bb   cb      9    NaN    NaN    NaN    NaN
2  ac    bc   cd    NaN    NaN    NaN    NaN    NaN
3  ad    bd   cd      9    1      NaN    NaN    NaN
4  ae    be   ce      9    NaN    2      NaN    12
5  af    bf   cf      9    NaN    NaN    3      14
6  ag    bg   cg      9    3      NaN    1      45

I want to do:

   a      b    c      x    a1     b1     c1     x1
0  aa    ba   ca      9    NaN    NaN    NaN    1
1  ab    bb   cb      9    NaN    NaN    NaN    9
2  ac    bc   cc    NaN    NaN    NaN    NaN    NaN
3  ad    bd   cd      9    1      NaN    NaN    NaN
4  ae    be   ce      9    NaN    2      NaN    12
5  af    bf   cf      9    NaN    NaN    3      14
6  ag    bg   cg      9    3      NaN    1      45

Explanation:

row 0: I don't want the replacement to happen, when there's already a value in x1, and even when all of the a1, b1 and c1 are NaNs.
rows 1 & 2: I want to fill column x1 with whatever is in x, only and only when columns a1, b1 and c1 are NaNs.
rows 3, 4, 5 & 6: I want the column x1 to stay as is, when any one or two columns from a1, b1 and c1 are NaNs.

Is there an efficient way to do this operation?


I have tried as shown in this and this stackoverflow answers, but I get broadcast error, presumably because I am trying to do the check on multiple columns.

Here's what I have tried:

np.where(np.isnan(df[['a1', 'b1', 'c1']].values), df['x'].values, df['x1'].values)

and my error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-bb6a0f9faf18> in <module>
----> 1 np.where(np.isnan(df[['a', 'b', 'c']].values), df['x'].values, df['x1'].values)

<__array_function__ internals> in where(*args, **kwargs)

ValueError: operands could not be broadcast together with shapes (176213,6) (176213,) (176213,) 

Upvotes: 1

Views: 820

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35626

One option is to use any on axis=1:

df['x1'] = np.where(df[['a1', 'b1', 'c1', 'x1']].any(axis=1), df['x1'], df['x'])

df:

    a   b   c    x   a1   b1   c1    x1
0  aa  ba  ca  9.0  NaN  NaN  NaN   1.0
1  ab  bb  cb  9.0  NaN  NaN  NaN   9.0
2  ac  bc  cd  NaN  NaN  NaN  NaN   NaN
3  ad  bd  cd  9.0  1.0  NaN  NaN   NaN
4  ae  be  ce  9.0  NaN  2.0  NaN  12.0
5  af  bf  cf  9.0  NaN  NaN  3.0  14.0
6  ag  bg  cg  9.0  3.0  NaN  1.0  45.0

Another option via @Jon Clements♦:

df.loc[~df[['a1', 'b1', 'c1', 'x1']].any(1), 'x1'] = df['x']

The reason for negating any instead of using all is that any will consider NaN as falsey, while all considers NaN as truthy:

pd.Series([np.nan, np.nan]).any()  # False
pd.Series([np.nan, np.nan]).all()  # True

Upvotes: 2

Related Questions