Vero
Vero

Reputation: 459

Applying series of conditions to dataframe. Pandas

I used to work with np.where function when applying multiple conditions to dataframe and feel ok in using it. I would like to improve my code where the same condition is repeated in each set of conditions in np.where and I do not know how to do it in most simple (clear and concise manner), either using (1) .loc or (2) IF "condition" DO " apply other conditions"

Example:

I need to select only rows where "Date" is under condition (eg. >20200201) and only for these rows, calculate new column, applying another set of different conditions (eg. condition 1: A >20 and B >20; condition 2: A==30 and B==10, condition 3: A==20 and B>=10 etc)

My question what will be the best way to make the first selection (Data >20200202) to not repeat Date>2020201 in every line and avoid this:

import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [1,3,2,2,3,1,3,2],
           "Date": [20200109, 20200204, 20200307, 20200216, 20200107, 20200108, 20200214, 20200314],
           "A": [20,10,40,40,10,20, 40,30], 
           "B": [20,30,40,50,20, 30, 20, 10]})

df['new']=np.nan
df['new']=np.where((df['Date']>20200201) & (df['A']>20) & (df['B']>20), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==30) & (df['B']==10), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==20) & (df['B']>=10), 'value', df['new'])

Upvotes: 1

Views: 133

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

It is probably not the fastest solution, but its advantage is readability and easy maintenance (in the future).

  1. Find rows in question using query and the indices of these rows:

    ind = df.query('Date > 20200201 and (A > 20 and B > 20 or '
        'A == 30 and B == 10 or A == 20 and B >= 10)').index
    
  2. Save new value in new column, in the indicated rows:

    df.loc[ind, 'new'] = 'value'; df
    

Other values in this column remain NaN.

If in the future something changes in the above condition, it is quite easy and intuitive to correct it.

So unless your data volume is very big and the execution time is prohibitively long, this solution is worth to consider.

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150785

Looks like you can use np.select:

s1 = df.Date <= 20200201
s2 = (df['A'] > 20) & df['B'].gt(20)
s3 = df['A'].eq(30) & df['B'].eq(10)
s4 = df['A'].eq(20) & df['B'].ge(10)

df['new'] = np.select( (s1,s2|s3|s4), (np.nan, 'value'), np.nan)

Output:

   ID      Date   A   B    new
0   1  20200109  20  20    nan
1   3  20200204  10  30    nan
2   2  20200307  40  40  value
3   2  20200216  40  50  value
4   3  20200107  10  20    nan
5   1  20200108  20  30    nan
6   3  20200214  40  20    nan
7   2  20200314  30  10  value

Upvotes: 2

Related Questions