msh855
msh855

Reputation: 1571

case_when function from R to Python

How I can implement the case_when function of R in a python code?

Here is the case_when function of R:

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when

as a minimum working example suppose we have the following dataframe (python code follows):

import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:

if age < 10 then baby
 if age >= 10 and age < 20 then kid 
if age >=20 and age < 30 then young 
if age >= 30 and age < 50 then mature 
if age >= 50 then grandpa 

Can someone help on this ?

Upvotes: 24

Views: 13143

Answers (7)

Robert Winkelman
Robert Winkelman

Reputation: 1

Also for future reference: the python siuba library/package implements case_when in a similar fashion to case_when from dplyr for those R tidyverse users out there (like me).

(df
>> mutate(elderly = case_when({_.age<10 : "baby",
                          _.age<20 : "kid",
                          _.age<30 : "young",
                          _.age<50 : "mature",
                          _.age >= 50 : "grandpa"})))

Upvotes: 0

Zweifler
Zweifler

Reputation: 451

Pandas has just released a case_when method.

caselist=[(df.age.lt(10), 'baby'), 
          (df.age.ge(10) & df.age.lt(20), 'kid'), 
          (df.age.ge(20) & df.age.lt(30), 'young'), 
          (df.age.ge(30) & df.age.lt(50), 'mature'), 
          (df.age.ge(50), 'grandpa')]

df.assign(elderly = df.age.case_when(caselist=caselist))
    name  age  preTestScore  postTestScore  elderly
0  Jason   42             4             25   mature
1  Molly   52            24             94  grandpa
2   Tina   36            31             57   mature
3   Jake   24             2             62    young
4    Amy   73             3             70  grandpa

Upvotes: 6

Just for Future reference, Nowadays you could use pandas cut or map with moderate to good speeed. If you need something faster It might not suit your needs, but is good enough for daily use and batches.

import pandas as pd

If you wanna choose map or apply mount your ranges and return something if in range

def calc_grade(age):
        if 50 < age < 200:
            return 'Grandpa'
        elif 30 <= age <=50:
            return 'Mature'
        elif 20 <= age < 30:
            return 'Young'
        elif 10 <= age < 20:
            return 'Kid'
        elif age < 10:
            return 'Baby'

%timeit df['elderly'] = df['age'].map(calc_grade)
name age preTestScore postTestScore elderly
0 Jason 42 4 25 Mature
1 Molly 52 24 94 Grandpa
2 Tina 36 31 57 Mature
3 Jake 24 2 62 Young
4 Amy 73 3 70 Grandpa

393 µs ± 8.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


If you wanna choose cut there should be many options. One approach - We includes to left, exclude to the right . To each bin, one label.

bins = [0, 10, 20, 30, 50, 200] #200 year Vampires are people I guess...you could change to a date you belieave plausible.
labels = ['Baby','Kid','Young', 'Mature','Grandpa']

%timeit df['elderly'] = pd.cut(x=df.age, bins=bins, labels=labels , include_lowest=True, right=False, ordered=False)
name age preTestScore postTestScore elderly
0 Jason 42 4 25 Mature
1 Molly 52 24 94 Grandpa
2 Tina 36 31 57 Mature
3 Jake 24 2 62 Young
4 Amy 73 3 70 Grandpa

Upvotes: 1

Fernando Rocha Urbano
Fernando Rocha Urbano

Reputation: 43

Steady of numpy you can create a function and use map or apply with lambda:

def elderly_function(age):
 if age < 10:
  return 'baby'
 if age < 20:
  return 'kid'
 if age < 30
  return 'young'
 if age < 50:
  return 'mature'
 if age >= 50:
  return 'grandpa'

df["elderly"] = df["age"].map(lambda x: elderly_function(x))
# Works with apply as well:
df["elderly"] = df["age"].apply(lambda x: elderly_function(x))

The solution with numpy is probably fast and might be preferable if your df is considerably large.

Upvotes: 0

sammywemmy
sammywemmy

Reputation: 28644

pyjanitor has a case_when implementation in dev that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable and fcase in R's data.table; under the hood, it uses pd.Series.mask:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
 df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa',  # default if none of the conditions match
 column_name = 'elderly') # column name to assign to
 
    name  age  preTestScore  postTestScore  elderly
0  Jason   42             4             25   mature
1  Molly   52            24             94  grandpa
2   Tina   36            31             57   mature
3   Jake   24             2             62    young
4    Amy   73             3             70  grandpa

Alby's solution is more efficient for this use case than an if/else construct.

Upvotes: 3

Alby
Alby

Reputation: 5742

np.select is great because it's a general way to assign values to elements in choicelist depending on conditions.

However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut method.


bin_cond = [-np.inf, 10, 20, 30, 50, np.inf]            # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)

#     name  age  preTestScore  postTestScore  elderly elderly2
# 0  Jason   42             4             25   mature   mature
# 1  Molly   52            24             94  grandpa  grandpa
# 2   Tina   36            31             57   mature   mature
# 3   Jake   24             2             62    young    young
# 4    Amy   73             3             70  grandpa  grandpa

Upvotes: 12

Alex
Alex

Reputation: 7045

You want to use np.select:

conditions = [
    (df["age"].lt(10)),
    (df["age"].ge(10) & df["age"].lt(20)),
    (df["age"].ge(20) & df["age"].lt(30)),
    (df["age"].ge(30) & df["age"].lt(50)),
    (df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]

df["elderly"] = np.select(conditions, choices)

# Results in:
#      name  age  preTestScore  postTestScore  elderly
#  0  Jason   42             4             25   mature
#  1  Molly   52            24             94  grandpa
#  2   Tina   36            31             57   mature
#  3   Jake   24             2             62    young
#  4    Amy   73             3             70  grandpa

The conditions and choices lists must be the same length.
There is also a default parameter that is used when all conditions evaluate to False.

Upvotes: 43

Related Questions