Reputation: 1222

Optimized way for multiple conditions on pandas dataframe columns value

I am applying multiple filters on a dataframe at the same time.

data_df[(data_df['1']!=0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']==0)]

I needed to know is there any optimized way to do this? As I want to compare one column's value as !=0 and others value as =0 multiple times and columns could be more than 5. So, all operations will be:

data_df[(data_df['1']==0) & (data_df['2']!=0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']!=0) & (data_df['4']==0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']!=0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']!=0)]

Looking for a short and optimized method.

Upvotes: 1

Answers (3)

jezrael

Reputation: 863301

One idea is compare by numpy array filled by 1 and 0 and test if all values matching by numpy.all:

#test list - all 0, first 1
L = [1,0,0,0,0]

df = data_df[np.all(data_df == np.array(L), axis=1)]

Or use DataFrame.merge by one row DataFrame:

df = data_df.merge(pd.DataFrame([L], columns=data_df.columns))

Sample:

np.random.seed(2020)
data_df = pd.DataFrame(np.random.randint(2, size=(100, 5)), columns=list('12345'))
#print (data_df)

df = data_df[np.all(data_df == np.array(L), axis=1)]
print (df)
    1  2  3  4  5
2   1  0  0  0  0
13  1  0  0  0  0
44  1  0  0  0  0
58  1  0  0  0  0
70  1  0  0  0  0
89  1  0  0  0  0

Or:

L = [1,0,0,0,0]
df = data_df.merge(pd.DataFrame([L], columns=data_df.columns))
print (df)
   1  2  3  4  5
0  1  0  0  0  0
1  1  0  0  0  0
2  1  0  0  0  0
3  1  0  0  0  0
4  1  0  0  0  0
5  1  0  0  0  0

Solution with merge should be used with helper DataFrame with all combinations:

df1 = pd.DataFrame(0, index=data_df.columns, columns=data_df.columns)
np.fill_diagonal(df1.to_numpy(), 1)
print (df1)
   1  2  3  4  5
1  1  0  0  0  0
2  0  1  0  0  0
3  0  0  1  0  0
4  0  0  0  1  0
5  0  0  0  0  1

df = data_df.merge(df1.loc[['1']])
print (df)
   1  2  3  4  5
0  1  0  0  0  0
1  1  0  0  0  0
2  1  0  0  0  0
3  1  0  0  0  0
4  1  0  0  0  0
5  1  0  0  0  0

df = data_df.merge(df1.loc[['2']])
print (df)
   1  2  3  4  5
0  0  1  0  0  0
1  0  1  0  0  0

Upvotes: 2

anky

Reputation: 75120

Based on the below statements:

Looking for a short and optimized method

and

I want to compare one column's value as !=0 and others value as =0

You can use df.ne and df.eq with df.drop on axis=1 to drop the column 1:

data_df[data_df[1].ne(0) & data_df.drop(1,axis=1).eq(0).all(axis=1)]

Upvotes: 2

willeM_ Van Onsem

Reputation: 477318

We can first calculate the dataframe, such that for the given columns we only calculate once if the value is equal to zero or not.

df_bool = df[['1', '2', '3', '4', '5']] == 0

Next we can use this as a mask:

df[~df_bool[1] & df[[2, 3, 4, 5]].all(axis=1)]

Upvotes: 2

Optimized way for multiple conditions on pandas dataframe columns value

Answers (3)

Related Questions