Reputation: 1571
How I can implement the case_when function of R in a python code?
Here is the case_when function of R:
https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when
as a minimum working example suppose we have the following dataframe (python code follows):
import pandas as pd
import numpy as np
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:
if age < 10 then baby
if age >= 10 and age < 20 then kid
if age >=20 and age < 30 then young
if age >= 30 and age < 50 then mature
if age >= 50 then grandpa
Can someone help on this ?
Upvotes: 24
Views: 13143
Reputation: 1
Also for future reference: the python siuba library/package implements case_when in a similar fashion to case_when from dplyr for those R tidyverse users out there (like me).
(df
>> mutate(elderly = case_when({_.age<10 : "baby",
_.age<20 : "kid",
_.age<30 : "young",
_.age<50 : "mature",
_.age >= 50 : "grandpa"})))
Upvotes: 0
Reputation: 451
Pandas has just released a case_when
method.
caselist=[(df.age.lt(10), 'baby'),
(df.age.ge(10) & df.age.lt(20), 'kid'),
(df.age.ge(20) & df.age.lt(30), 'young'),
(df.age.ge(30) & df.age.lt(50), 'mature'),
(df.age.ge(50), 'grandpa')]
df.assign(elderly = df.age.case_when(caselist=caselist))
name age preTestScore postTestScore elderly
0 Jason 42 4 25 mature
1 Molly 52 24 94 grandpa
2 Tina 36 31 57 mature
3 Jake 24 2 62 young
4 Amy 73 3 70 grandpa
Upvotes: 6
Reputation: 169
Just for Future reference, Nowadays you could use pandas cut or map with moderate to good speeed. If you need something faster It might not suit your needs, but is good enough for daily use and batches.
import pandas as pd
If you wanna choose map or apply mount your ranges and return something if in range
def calc_grade(age):
if 50 < age < 200:
return 'Grandpa'
elif 30 <= age <=50:
return 'Mature'
elif 20 <= age < 30:
return 'Young'
elif 10 <= age < 20:
return 'Kid'
elif age < 10:
return 'Baby'
%timeit df['elderly'] = df['age'].map(calc_grade)
name | age | preTestScore | postTestScore | elderly | |
---|---|---|---|---|---|
0 | Jason | 42 | 4 | 25 | Mature |
1 | Molly | 52 | 24 | 94 | Grandpa |
2 | Tina | 36 | 31 | 57 | Mature |
3 | Jake | 24 | 2 | 62 | Young |
4 | Amy | 73 | 3 | 70 | Grandpa |
393 µs ± 8.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you wanna choose cut there should be many options. One approach - We includes to left, exclude to the right . To each bin, one label.
bins = [0, 10, 20, 30, 50, 200] #200 year Vampires are people I guess...you could change to a date you belieave plausible.
labels = ['Baby','Kid','Young', 'Mature','Grandpa']
%timeit df['elderly'] = pd.cut(x=df.age, bins=bins, labels=labels , include_lowest=True, right=False, ordered=False)
name | age | preTestScore | postTestScore | elderly | |
---|---|---|---|---|---|
0 | Jason | 42 | 4 | 25 | Mature |
1 | Molly | 52 | 24 | 94 | Grandpa |
2 | Tina | 36 | 31 | 57 | Mature |
3 | Jake | 24 | 2 | 62 | Young |
4 | Amy | 73 | 3 | 70 | Grandpa |
Upvotes: 1
Reputation: 43
Steady of numpy you can create a function and use map or apply with lambda:
def elderly_function(age):
if age < 10:
return 'baby'
if age < 20:
return 'kid'
if age < 30
return 'young'
if age < 50:
return 'mature'
if age >= 50:
return 'grandpa'
df["elderly"] = df["age"].map(lambda x: elderly_function(x))
# Works with apply as well:
df["elderly"] = df["age"].apply(lambda x: elderly_function(x))
The solution with numpy is probably fast and might be preferable if your df is considerably large.
Upvotes: 0
Reputation: 28644
pyjanitor has a case_when implementation in dev
that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable
and fcase in R's data.table
; under the hood, it uses pd.Series.mask:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa', # default if none of the conditions match
column_name = 'elderly') # column name to assign to
name age preTestScore postTestScore elderly
0 Jason 42 4 25 mature
1 Molly 52 24 94 grandpa
2 Tina 36 31 57 mature
3 Jake 24 2 62 young
4 Amy 73 3 70 grandpa
Alby's solution is more efficient for this use case than an if/else construct.
Upvotes: 3
Reputation: 5742
np.select
is great because it's a general way to assign values to elements in choicelist depending on conditions.
However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut
method.
bin_cond = [-np.inf, 10, 20, 30, 50, np.inf] # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)
# name age preTestScore postTestScore elderly elderly2
# 0 Jason 42 4 25 mature mature
# 1 Molly 52 24 94 grandpa grandpa
# 2 Tina 36 31 57 mature mature
# 3 Jake 24 2 62 young young
# 4 Amy 73 3 70 grandpa grandpa
Upvotes: 12
Reputation: 7045
You want to use np.select
:
conditions = [
(df["age"].lt(10)),
(df["age"].ge(10) & df["age"].lt(20)),
(df["age"].ge(20) & df["age"].lt(30)),
(df["age"].ge(30) & df["age"].lt(50)),
(df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]
df["elderly"] = np.select(conditions, choices)
# Results in:
# name age preTestScore postTestScore elderly
# 0 Jason 42 4 25 mature
# 1 Molly 52 24 94 grandpa
# 2 Tina 36 31 57 mature
# 3 Jake 24 2 62 young
# 4 Amy 73 3 70 grandpa
The conditions
and choices
lists must be the same length.
There is also a default
parameter that is used when all conditions
evaluate to False
.
Upvotes: 43