Reputation: 13
I want to classify a variable based on predefined thresholds as follows:
library(tidyverse)
df <- tibble(values = sample(1:50))
classes <- c("A","B","C","D")
upper <- c(10,19,34,50)
lower <- c(0, upper[1:length(upper)-1])
segment <- df %>%
mutate(
class = case_when(
values >= lower[1] & values < upper[1] ~ classes[1],
values >= lower[2] & values < upper[2] ~ classes[2],
values >= lower[3] & values < upper[3] ~ classes[3],
values >= lower[4] & values < upper[4] ~ classes[4]
)
)
A new variable class
is generated which takes the class names as defined in classes
. At the moment case_when
is hardcoded for each separate entry of classes
. This is fine as long as the number of classes remains small, but if I want to increase the number of classes the hardcoding solution becomes unpractical. Is it possible to incorporate purrr::map within case_when to handle this?
Following implementation did not work:
segment <- df %>%
mutate(
class = case_when(
purrr::map(values >= lower & values < upper ~ classes)
)
)
Upvotes: 0
Views: 413
Reputation: 5898
A non-equi in data.table
is probably the fastest solution in R:
library(tidyverse)
library(data.table)
df <- tibble(values = sample(1:50))
classes <- c("A","B","C","D")
upper <- c(10,19,34,50)
lower <- c(0, upper[1:length(upper)-1])
setDT(df)
interval_lookup <- data.table(classes, upper,lower)
df[interval_lookup, classes:=classes, on=c("values >= lower","values < upper")]
df
#> values classes
#> 1: 11 B
#> 2: 31 C
#> 3: 12 B
#> 4: 6 A
#> 5: 29 C
#> 6: 38 D
#> 7: 45 D
#> 8: 28 C
#> 9: 10 B
#> 10: 3 A
#> 11: 15 B
#> 12: 43 D
#> 13: 37 D
#> 14: 14 B
#> 15: 36 D
#> 16: 33 C
#> 17: 27 C
#> 18: 8 A
#> 19: 26 C
#> 20: 47 D
#> 21: 9 A
#> 22: 39 D
#> 23: 22 C
#> 24: 49 D
#> 25: 34 D
#> 26: 23 C
#> 27: 42 D
#> 28: 4 A
#> 29: 32 C
#> 30: 20 C
#> 31: 40 D
#> 32: 21 C
#> 33: 17 B
#> 34: 16 B
#> 35: 30 C
#> 36: 46 D
#> 37: 25 C
#> 38: 24 C
#> 39: 5 A
#> 40: 44 D
#> 41: 41 D
#> 42: 50 <NA>
#> 43: 18 B
#> 44: 1 A
#> 45: 48 D
#> 46: 7 A
#> 47: 19 C
#> 48: 2 A
#> 49: 35 D
#> 50: 13 B
#> values classes
Created on 2021-01-13 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 887223
We can also use findInterval
df$class <- c("A", "B", "C", "D")[findInterval(cut$values, c(0, 10, 19, 34, 50))]
Upvotes: 0
Reputation: 2301
It seems like you could just use a cut
function:
breaks <- c(0,10,19,34,50)
labels <- c("A","B","C","D")
df$class <- cut(df$values, breaks = breaks, labels = labels)
Upvotes: 1