Reputation: 14774
Some time ago they introduced a nice SQL-like alternative to ifelse
within dplyr
, i.e. case_when
.
Is there an equivalent in data.table
that would allow you to specify different conditions within one []
statement, without loading additional packages?
Example:
library(dplyr)
df <- data.frame(a = c("a", "b", "a"), b = c("b", "a", "a"))
df <- df %>% mutate(
new = case_when(
a == "a" & b == "b" ~ "c",
a == "b" & b == "a" ~ "d",
TRUE ~ "e")
)
a b new
1 a b c
2 b a d
3 a a e
It would certainly be very helpful and make code much more readable (one of the reasons why I keep using dplyr
in these cases).
Upvotes: 26
Views: 14878
Reputation: 499
FYI, a more recent answer for those coming across this post 2019. data.table
versions above 1.13.0 have the fcase
function that can be used. Note that it is not a drop-in replacement for dplyr::case_when
as the syntax is different, but will be a "native" data.table
way of calculation.
# Lazy evaluation
x = 1:10
data.table::fcase(
x < 5L, 1L,
x >= 5L, 3L,
x == 5L, stop("provided value is an unexpected one!")
)
# [1] 1 1 1 1 3 3 3 3 3 3
dplyr::case_when(
x < 5L ~ 1L,
x >= 5L ~ 3L,
x == 5L ~ stop("provided value is an unexpected one!")
)
# Error in eval_tidy(pair$rhs, env = default_env) :
# provided value is an unexpected one!
# Benchmark
x = sample(1:100, 3e7, replace = TRUE) # 114 MB
microbenchmark::microbenchmark(
dplyr::case_when(
x < 10L ~ 0L,
x < 20L ~ 10L,
x < 30L ~ 20L,
x < 40L ~ 30L,
x < 50L ~ 40L,
x < 60L ~ 50L,
x > 60L ~ 60L
),
data.table::fcase(
x < 10L, 0L,
x < 20L, 10L,
x < 30L, 20L,
x < 40L, 30L,
x < 50L, 40L,
x < 60L, 50L,
x > 60L, 60L
),
times = 5L,
unit = "s")
# Unit: seconds
# expr min lq mean median uq max neval
# dplyr::case_when 11.57 11.71 12.22 11.82 12.00 14.02 5
# data.table::fcase 1.49 1.55 1.67 1.71 1.73 1.86 5
Source, data.table NEWS for 1.13.0, released (24 Jul 2020).
Upvotes: 43
Reputation: 50738
This is not really an answer, but a bit too long for a comment. If deemed inappropriate I'm happy to remove the post.
There exists an interesting post on RStudio Community that discusses options to use dplyr::case_when
without the usual tidyverse
dependencies.
To summarise, three alternatives seem to exist:
case_when
from dplyr
and build a new package lest
that depends only on base
.noplyr
, which "provides basic dplyr
and tidyr
functionality without the tidyverse dependencies".freebase
, a "A 'usethis'-like Package for Base R Pseudo-equivalents of 'tidyverse' Code", which might also be worth checking out. If it is only case_when
that you're after, I imagine lest
might be an attractive & minimal option in combination with data.table
.
Tyson Barrett recently made the package tidyfast
available (currently as version 0.1.0
) on GitHub, which provides function "dt_case_when
for dplyr::case_when()
syntax with the speed of data.table::fifelse()
".
There is also dtplyr, authored by Lionel Henry and maintained by Hadley Wickham, which "provides a data.table
backend for dplyr
. The goal of dtplyr
is to allow you to write dplyr
code that is automatically translated to the equivalent, but usually much faster, data.table
code.".
Upvotes: 22
Reputation: 47350
Here is a variation on @g-grothendieck's answer that works for non exclusive conditions :
DT[, new := c("c", "d", "e")[
apply(cbind(
a == "a" & b == "b",
a == "b" & b == "a",
TRUE), 1, which.max)]
]
DT
# a b new
# 1: a b c
# 2: b a d
# 3: a a e
Upvotes: 4
Reputation: 270428
1) If the conditions are mutually exclusive with a default if all conditions are false then this works:
library(data.table)
DT <- as.data.table(df) # df is from question
DT[, new := c("e", "c", "d")[1 +
1 * (a == "a" & b == "b") +
2 * (a == "b" & b == "a")]
]
giving:
> DT
a b new
1: a b c
2: b a d
3: a a e
2) If the results of the conditions are numeric then it is even easier. For example suppose instead of c
and d
we want 10 and 17 with a default of 3. Then:
library(data.table)
DT <- as.data.table(df) # df is from question
DT[, new := 3 +
(10 - 3) * (a == "a" & b == "b") +
(17 - 3) * (a == "b" & b == "a")]
3) Note that adding a 1-liner is sufficient to implement this. It assumes that there is at least one TRUE leg for each row.
when <- function(...) names(match.call()[-1])[apply(cbind(...), 1, which.max)]
# test
DT[, new := when(c = a == 'a' & b == 'b',
d = a == 'b' & b == 'a',
e = TRUE)]
Upvotes: 22