rnso
rnso

Reputation: 24535

New column based on multiple conditions ignoring missing values

I have following dataframe with some missing values:

       A     B
0  63.0   9.0
1   NaN  35.0
2  51.0  95.0
3  25.0  11.0
4  91.0   NaN
5   2.0  47.0
6  37.0  10.0
7   NaN  88.0
8  75.0  87.0
9  92.0  21.0

I want to create a new column based on conditions of both above columns:

df['C'] = numpy.where((df['A']>55) | (df['B']>55), "Yes", "No")

This works but does not take into account missing values:

      A     B    C
0  63.0   9.0  Yes
1   NaN  35.0   No
2  51.0  95.0  Yes
3  25.0  11.0   No
4  91.0   NaN  Yes
5   2.0  47.0   No
6  37.0  10.0   No
7   NaN  88.0  Yes
8  75.0  87.0  Yes
9  92.0  21.0  Yes

For correcting for missing values, I have to run following code:

df['C'] = numpy.where((df['A'].isnull()) | (df['B'].isnull()), numpy.nan, df['C'])

Then I get proper new column:

      A     B    C
0  63.0   9.0  Yes
1   NaN  35.0  NaN
2  51.0  95.0  Yes
3  25.0  11.0   No
4  91.0   NaN  NaN
5   2.0  47.0   No
6  37.0  10.0   No
7   NaN  88.0  NaN
8  75.0  87.0  Yes
9  92.0  21.0  Yes

Is there any better way so that I can correct for missing values in single line of code only?

Upvotes: 3

Views: 996

Answers (4)

DSM
DSM

Reputation: 353009

IMO if we want to ignore the NaN rows, let's just drop them. Assignment will align on the indices, resulting in NaNs where the index is missing.

In [317]: df["C"] = (df[["A","B"]].dropna() > 55).any(axis=1).replace(
                    {False: "No", True: "Yes"})

In [318]: df
Out[318]: 
      A     B    C
0  63.0   9.0  Yes
1   NaN  35.0  NaN
2  51.0  95.0  Yes
3  25.0  11.0   No
4  91.0   NaN  NaN
5   2.0  47.0   No
6  37.0  10.0   No
7   NaN  88.0  NaN
8  75.0  87.0  Yes
9  92.0  21.0  Yes

Upvotes: 2

user3483203
user3483203

Reputation: 51165

Using np.select. The order of the conditions is important here, as np.select chooses the first valid option, so you must have your null check first.

c1 = df.isnull().any(1)
c2 = df.gt(55).any(1)

df['C'] = np.select([c1, c2], [np.nan, 'Yes'], 'No')

      A     B    C
0  63.0   9.0  Yes
1   NaN  35.0  nan
2  51.0  95.0  Yes
3  25.0  11.0   No
4  91.0   NaN  nan
5   2.0  47.0   No
6  37.0  10.0   No
7   NaN  88.0  nan
8  75.0  87.0  Yes
9  92.0  21.0  Yes

Upvotes: 4

cs95
cs95

Reputation: 402323

This becomes a little complicated depending on how many columns you want to do this for.

Here's an alternative with map and mask:

(df[['A', 'B']]
   .gt(55)
   .any(1)
   .map(lambda x: 'Yes' if x else 'No')  # .map({True : 'Yes', False : 'No'}.__getitem__)
   .mask(df[['A', 'B']].isna().any(1))
) 

0    Yes
1    NaN
2    Yes
3     No
4    NaN
5     No
6     No
7    NaN
8    Yes
9    Yes
dtype: object

This is still a single statement, split across multiple lines for readability.

Upvotes: 3

BENY
BENY

Reputation: 323226

Chain two any and map

(df>55).any(1).mask((df.isnull()).any(1),np.nan).map({1:'Yes',0:'No'})
Out[405]: 
0    Yes
1    NaN
2    Yes
3     No
4    NaN
5     No
6     No
7    NaN
8    Yes
9    Yes
dtype: object

Or using two np.where

a=(df>55).any(1)
b=df.isnull().any(1)
np.where(b,np.nan,np.where(a,'Yes','No'))
Out[407]: 
array(['Yes', 'nan', 'Yes', 'No', 'nan', 'No', 'No', 'nan', 'Yes', 'Yes'],
      dtype='<U32')

Upvotes: 3

Related Questions