Reputation: 459
I used to work with np.where function when applying multiple conditions to dataframe and feel ok in using it. I would like to improve my code where the same condition is repeated in each set of conditions in np.where and I do not know how to do it in most simple (clear and concise manner), either using (1) .loc or (2) IF "condition" DO " apply other conditions"
Example:
I need to select only rows where "Date" is under condition (eg. >20200201) and only for these rows, calculate new column, applying another set of different conditions (eg. condition 1: A >20 and B >20; condition 2: A==30 and B==10, condition 3: A==20 and B>=10 etc)
My question what will be the best way to make the first selection (Data >20200202) to not repeat Date>2020201 in every line and avoid this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,3,2,2,3,1,3,2],
"Date": [20200109, 20200204, 20200307, 20200216, 20200107, 20200108, 20200214, 20200314],
"A": [20,10,40,40,10,20, 40,30],
"B": [20,30,40,50,20, 30, 20, 10]})
df['new']=np.nan
df['new']=np.where((df['Date']>20200201) & (df['A']>20) & (df['B']>20), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==30) & (df['B']==10), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==20) & (df['B']>=10), 'value', df['new'])
Upvotes: 1
Views: 133
Reputation: 31011
It is probably not the fastest solution, but its advantage is readability and easy maintenance (in the future).
Find rows in question using query and the indices of these rows:
ind = df.query('Date > 20200201 and (A > 20 and B > 20 or '
'A == 30 and B == 10 or A == 20 and B >= 10)').index
Save new value in new column, in the indicated rows:
df.loc[ind, 'new'] = 'value'; df
Other values in this column remain NaN.
If in the future something changes in the above condition, it is quite easy and intuitive to correct it.
So unless your data volume is very big and the execution time is prohibitively long, this solution is worth to consider.
Upvotes: 0
Reputation: 150785
Looks like you can use np.select
:
s1 = df.Date <= 20200201
s2 = (df['A'] > 20) & df['B'].gt(20)
s3 = df['A'].eq(30) & df['B'].eq(10)
s4 = df['A'].eq(20) & df['B'].ge(10)
df['new'] = np.select( (s1,s2|s3|s4), (np.nan, 'value'), np.nan)
Output:
ID Date A B new
0 1 20200109 20 20 nan
1 3 20200204 10 30 nan
2 2 20200307 40 40 value
3 2 20200216 40 50 value
4 3 20200107 10 20 nan
5 1 20200108 20 30 nan
6 3 20200214 40 20 nan
7 2 20200314 30 10 value
Upvotes: 2