Reputation: 52235
np.where
has the semantics of a vectorized if/else (similar to Apache Spark's when
/otherwise
DataFrame method). I know that I can use np.where
on pandas.Series
, but pandas
often defines its own API to use instead of raw numpy
functions, which is usually more convenient with pd.Series
/pd.DataFrame
.
Sure enough, I found pandas.DataFrame.where
. However, at first glance, it has completely different semantics. I could not find a way to rewrite the most basic example of np.where
using pandas where
:
# df is pd.DataFrame
# how to write this using df.where?
df['C'] = np.where((df['A']<0) | (df['B']>0), df['A']+df['B'], df['A']/df['B'])
Am I missing something obvious? Or is pandas' where
intended for a completely different use case, despite same name as np.where
?
Upvotes: 90
Views: 172948
Reputation: 402363
Series.case_when
From pandas 2.2.0, the API provides a pandaic alternative to np.where
and np.select
.
Using case_when
:
cond = (df['A'] < 0) | (df['B'] > 0)
df['C'] = (df['A'] / df['B']).case_when([(cond, df['A'] + df['B'])])
# or
df['C'] = 0 # Or pd.NA or any reasonable default.
df['C'] = df['C'].case_when([(cond, df['A'] + df['B']),
(~cond, df['A'] / df['B']),
])
You notice that case_when
allows you to provide an arbitrary list of conditions and replacement pairs, so this can generalize to several conditions easily (much like np.select
).
Using np.where
:
df['C'] = np.where((df['A'] < 0) | (df['B'] > 0), df['A'] + df['B'], df['A'] / df['B'])
Upvotes: 5
Reputation: 2300
I prefer using pandas' mask
over where
since it is less counterintuitive (at least for me).
(df['A']/df['B']).mask(df['A']<0) | (df['B']>0), df['A']+df['B'])
Here, column A
and B
are added where the condition holds, otherwise their ratio stays untouched.
Upvotes: 4
Reputation: 19104
Try:
(df['A'] + df['B']).where((df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])
The difference between the numpy
where
and DataFrame
where
is that the default values are supplied by the DataFrame
that the where
method is being called on (docs).
I.e.
np.where(m, A, B)
is roughly equivalent to
A.where(m, B)
If you wanted a similar call signature using pandas, you could take advantage of the way method calls work in Python:
pd.DataFrame.where(cond=(df['A'] < 0) | (df['B'] > 0), self=df['A'] + df['B'], other=df['A'] / df['B'])
or without kwargs (Note: that the positional order of arguments is different from the numpy
where
argument order):
pd.DataFrame.where(df['A'] + df['B'], (df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])
Upvotes: 79