saga
saga

Reputation: 755

Pandas conditional creation of a dataframe column: based on multiple conditions

I have a df:

  col1 col2 col3
0    1    2    3
1    2    3    1
2    3    3    3
3    4    3    2

I want to add a new column based on the following conditions:

 - if   col1 > col2 > col3   ----->  2
 - elif col1 > col2          ----->  1
 - elif col1 < col2 < col3   -----> -2
 - elif col1 < col2          -----> -1
 - else                      ----->  0

And it should become this:

  col1 col2 col3   new
0    1    2    3   -2
1    2    3    1   -1
2    3    3    3    0
3    4    3    2    2

I followed the method from this post by unutbu, with 1 greater than or less than is fine. But in my case with more than 1 greater than or less than, conditions returns error:

conditions = [
       (df['col1'] > df['col2'] > df['col3']), 
       (df['col1'] > df['col2']),
       (df['col1'] < df['col2'] < df['col3']),
       (df['col1'] < df['col2'])]
choices = [2,1,-2,-1]
df['new'] = np.select(conditions, choices, default=0)


Traceback (most recent call last):

  File "<ipython-input-43-768a4c0ecf9f>", line 2, in <module>
    (df['col1'] > df['col2'] > df['col3']),

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1478, in __nonzero__
    .format(self.__class__.__name__))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How should I do this?

Upvotes: 4

Views: 234

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

One option is with case_when from pyjanitor; under the hood it uses pd.Series.mask.

The basic idea is a pairing of condition and expected value; you can pass as many pairings as required, followed by a default value and a target column name:

# pip install pyjanitor
import pandas as pd
import janitor

df.case_when( 
      # condition, value
     'col1>col2>col3', 2,
     'col1>col2', 1,
     'col1<col2<col3', -2,
     'col1<col2', -1,
     0, # default
     column_name = 'new')

   col1  col2  col3  new
0     1     2     3   -2
1     2     3     1   -1
2     3     3     3    0
3     4     3     2    2

The code above uses strings for the conditions, which are evaluated by pd.eval on the parent dataframe - note that speed wise, this can be slower for small datasets. A faster option (depending on the data size) would be to avoid the pd.eval option:

df.case_when( 
      # condition, value
     df.col1.gt(df.col2) & df.col2.gt(df.col3), 2,
     df.col1.gt(df.col2), 1,
     df.col1.lt(df.col2) & df.col2.lt(df.col3), -2,
     df.col1.lt(df.col2), -1,
     0, # default
     column_name = 'new')

   col1  col2  col3  new
0     1     2     3   -2
1     2     3     1   -1
2     3     3     3    0
3     4     3     2    2

Upvotes: 0

BENY
BENY

Reputation: 323266

Change your code to

conditions = [
       (df['col1'] > df['col2']) &  (df['col2'] > df['col3']), 
       (df['col1'] > df['col2']),
       (df['col1'] < df['col2']) & (df['col2'] < df['col3']),
       (df['col1'] < df['col2'])]
choices = [2,1,-2,-1]
df['new'] = np.select(conditions, choices, default=0)

Upvotes: 3

Related Questions