Reputation: 24535
I have following dataframe with some missing values:
A B
0 63.0 9.0
1 NaN 35.0
2 51.0 95.0
3 25.0 11.0
4 91.0 NaN
5 2.0 47.0
6 37.0 10.0
7 NaN 88.0
8 75.0 87.0
9 92.0 21.0
I want to create a new column based on conditions of both above columns:
df['C'] = numpy.where((df['A']>55) | (df['B']>55), "Yes", "No")
This works but does not take into account missing values:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 No
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN Yes
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 Yes
8 75.0 87.0 Yes
9 92.0 21.0 Yes
For correcting for missing values, I have to run following code:
df['C'] = numpy.where((df['A'].isnull()) | (df['B'].isnull()), numpy.nan, df['C'])
Then I get proper new column:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 NaN
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN NaN
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 NaN
8 75.0 87.0 Yes
9 92.0 21.0 Yes
Is there any better way so that I can correct for missing values in single line of code only?
Upvotes: 3
Views: 996
Reputation: 353009
IMO if we want to ignore the NaN rows, let's just drop them. Assignment will align on the indices, resulting in NaNs where the index is missing.
In [317]: df["C"] = (df[["A","B"]].dropna() > 55).any(axis=1).replace(
{False: "No", True: "Yes"})
In [318]: df
Out[318]:
A B C
0 63.0 9.0 Yes
1 NaN 35.0 NaN
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN NaN
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 NaN
8 75.0 87.0 Yes
9 92.0 21.0 Yes
Upvotes: 2
Reputation: 51165
Using np.select
. The order of the conditions is important here, as np.select
chooses the first valid option, so you must have your null check first.
c1 = df.isnull().any(1)
c2 = df.gt(55).any(1)
df['C'] = np.select([c1, c2], [np.nan, 'Yes'], 'No')
A B C
0 63.0 9.0 Yes
1 NaN 35.0 nan
2 51.0 95.0 Yes
3 25.0 11.0 No
4 91.0 NaN nan
5 2.0 47.0 No
6 37.0 10.0 No
7 NaN 88.0 nan
8 75.0 87.0 Yes
9 92.0 21.0 Yes
Upvotes: 4
Reputation: 402323
This becomes a little complicated depending on how many columns you want to do this for.
Here's an alternative with map
and mask
:
(df[['A', 'B']]
.gt(55)
.any(1)
.map(lambda x: 'Yes' if x else 'No') # .map({True : 'Yes', False : 'No'}.__getitem__)
.mask(df[['A', 'B']].isna().any(1))
)
0 Yes
1 NaN
2 Yes
3 No
4 NaN
5 No
6 No
7 NaN
8 Yes
9 Yes
dtype: object
This is still a single statement, split across multiple lines for readability.
Upvotes: 3
Reputation: 323226
Chain two any
and map
(df>55).any(1).mask((df.isnull()).any(1),np.nan).map({1:'Yes',0:'No'})
Out[405]:
0 Yes
1 NaN
2 Yes
3 No
4 NaN
5 No
6 No
7 NaN
8 Yes
9 Yes
dtype: object
Or using two np.where
a=(df>55).any(1)
b=df.isnull().any(1)
np.where(b,np.nan,np.where(a,'Yes','No'))
Out[407]:
array(['Yes', 'nan', 'Yes', 'No', 'nan', 'No', 'No', 'nan', 'Yes', 'Yes'],
dtype='<U32')
Upvotes: 3