Reputation: 77

Pandas conditional filter

I have a dataframe

   A     B     C
0  True  True  True
1  True  False False
2  False False False

I would like to add a row D with the following conditions:

D is true, if A, B and C are true. Else, D is false.

I tried

df['D'] = df.loc[(df['A'] == True) & df['B'] == True & df['C'] == True]

I get

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

Then I tried to follow this example and wrote a similar function as suggested in the link:

def all_true(row):

   if row['A'] == True:
      if row['B'] == True:
         if row['C'] == True:
             val = True
   else:
      val = 0

return val

df['D'] = df.apply(all_true(df), axis=1)

In which case I get

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'd appreciate suggestions. Thanks!

Upvotes: 5

Answers (3)

jezrael

Reputation: 862681

Comparing with True is not necessary, ony chain boolean masks with &:

df['D'] = df['A'] & df['B'] & df['C']

If performance is important:

df['D'] = df['A'].values & df['B'].values & df['C'].values

Or use DataFrame.all for check all Trues per rows:

df['D'] = df[['A','B','C']].all(axis=1)

#numpy all 
#df['D'] = np.all(df.values,1)

print (df)
       A      B      C      D
0   True   True   True   True
1   True  False  False  False
2  False  False  False  False

Performance:

np.random.seed(125)

def all1(df):
    df['D'] = df.all(axis=1)
    return df

def all1_numpy(df):
    df['D'] = np.all(df.values,1)
    return df

def eval1(df):
    df['D'] = df.eval('A & B & C')
    return df

def chained(df):
    df['D'] = df['A'] & df['B'] & df['C']
    return df

def chained_numpy(df):
    df['D'] = df['A'].values & df['B'].values & df['C'].values
    return df

def make_df(n):
    df = pd.DataFrame({'A':np.random.choice([True, False], size=n),
                       'B':np.random.choice([True, False], size=n),
                       'C':np.random.choice([True, False], size=n)})
    return df

perfplot.show(
    setup=make_df,
    kernels=[all1, all1_numpy, eval1,chained,chained_numpy],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Upvotes: 5

Space Impact

Reputation: 13255

Using pandas eval:

df['D'] = df.eval('A & B & C')

Or:

df = df.eval('D = A & B & C')
#alternative inplace df.eval('D = A & B & C', inplace=True)

Or:

df['D'] = np.all(df.values,1)

print(df)
       A      B      C      D
0   True   True   True   True
1   True  False  False  False
2  False  False  False  False

Upvotes: 2

U13-Forward

Reputation: 71580

Or even better:

df['D']=df.all(1)

And now:

print(df)

Is:

       A      B      C      D
0   True   True   True   True
1   True  False  False  False
2  False  False  False  False

Upvotes: 5

Pandas conditional filter

Answers (3)

Related Questions