user3786999
user3786999

Reputation: 1127

Using conditional to generate new column in pandas dataframe

I have a pandas dataframe that looks like this:

   portion  used
0        1   1.0
1        2   0.3
2        3   0.0
3        4   0.8

I'd like to create a new column based on the used column, so that the df looks like this:

   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

What's the best way to do that?

Upvotes: 30

Views: 106421

Answers (6)

tdy
tdy

Reputation: 41327

Use np.select() for >2 conditions

Given >2 conditions like OP's example, np.select() is much cleaner than nesting multiple levels of np.where() (and is just as fast).

  • Either define the conditions/choices as two lists (paired element-wise) with an optional default value ("else" case):

    conditions = [
        df.used.eq(0),
        df.used.eq(1),
    ]
    choices = [
        'Empty',
        'Full',
    ]
    df['alert'] = np.select(conditions, choices, default='Partial')
    
  • Or define the conditions/choices as a dictionary for maintainability (easier to keep them paired properly when making additions/revisions):

    conditions = {
        'Empty': df.used.eq(0),
        'Full': df.used.eq(1),
    }
    df['alert'] = np.select(conditions.values(), conditions.keys(), default='Partial')
    

np.select() is very fast

Timings with 5 conditions (full, high, medium, low, empty):

timings with 5 conditions

df = pd.DataFrame({'used': np.random.randint(10 + 1, size=10)}).div(10)

Upvotes: 6

Ffisegydd
Ffisegydd

Reputation: 53688

You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.apply to apply the function to each row. Note that you have to pass the keyword argument axis=1 to ensure that it applies the function to rows.

import pandas as pd

def alert(row):
  if row['used'] == 1.0:
    return 'Full'
  elif row['used'] == 0.0:
    return 'Empty'
  elif 0.0 < row['used'] < 1.0:
    return 'Partial'
  else:
    return 'Undefined'

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

#    portion  used    alert
# 0        1   1.0     Full
# 1        2   0.3  Partial
# 2        3   0.0    Empty
# 3        4   0.8  Partial

Upvotes: 50

df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))

This would appear to work, except for the ValueError: either both or neither of x and y should be given

Upvotes: 1

Hansang
Hansang

Reputation: 1622

Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get() method to make the function to pass in to .apply() easier to manage:

import pandas as pd

def alert(c):
    mapping = {1.0: 'Full', 0.0: 'Empty'}
    return mapping.get(c['used'], 'Partial')

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

Depending on the use case, you might like to define the dict outside of the function definition as well.

Upvotes: 1

Primer
Primer

Reputation: 10302

Alternatively you could do:

import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})

%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'

Which gives the same output but runs about 100 times faster on 10000 rows:

100 loops, best of 3: 2.91 ms per loop

Then using apply:

%timeit df['alert'] = df.apply(alert, axis=1)

1 loops, best of 3: 287 ms per loop

I guess the choice depends on how big is your dataframe.

Upvotes: 45

Zero
Zero

Reputation: 76927

Use np.where, is usually fast

In [845]: df['alert'] = np.where(df.used == 1, 'Full', 
                                 np.where(df.used == 0, 'Empty', 'Partial'))

In [846]: df
Out[846]:
   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

Timings

In [848]: df.shape
Out[848]: (100000, 3)

In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop

In [850]: %%timeit
     ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
     ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
     ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
     ...:
10 loops, best of 3: 21.9 ms per loop

In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop

Upvotes: 21

Related Questions