Jaskeil
Jaskeil

Reputation: 1232

Issue with case_when statement using & in dplyr?

I am attempting to create an additional column in my dataset that bucketizes percentiles. Ideally I would to create logic as follows:

CASE 
  WHEN  percentile >=  75 AND  percentile < 90  THEN "75%-89% Percentile"
  WHEN percentile >=  50 AND  percentile < 75  THEN "50%-75% Percentile"

END 

What I have attempted dplyr is the following:

  mutate(Bucket = case_when(as.double(percentile) >= 90 ~ "90%-100% Percentile",
                            as.double(percentile) >=  75 & as.double(percentile) < 90  ~ "75%-89% Percentile",
                            as.double(percentile) <  75 & as.double(percentile) >= 50  ~ "50%-75% Percentile",
                            as.double(percentile) <  50 & as.double(percentile) >= 25  ~ "25%-50% Percentile",
                            as.double(percentile) <  25 & as.double(percentile) >= 0  ~ "0%-25% Percentile"))

However it is not bucketizing correctly, see a sample of the results in the screenshot. The bucket flag for these percentiles should be "75%-89% Percentile" :

enter image description here

Upvotes: 1

Views: 142

Answers (1)

akrun
akrun

Reputation: 887551

The column percentile is factor. We need to convert to character class first and then to numeric

library(dplyr)
 df1 %>%
     mutate(percentile = as.numeric(as.character(percentile))) %>%
     ...

What happens is that when we directly coerce to numeric/integer, it gets coerced to integer storage values instead of the actual values

v1 <- factor(c(81.9, 82.7, 81.9, 82.5))
as.numeric(v1)
#[1] 1 3 1 2

is different than the following

as.numeric(as.character(v1))
#[1] 81.9 82.7 81.9 82.5

Or probably faster with levels

as.numeric(levels(v1)[v1])
#[1] 81.9 82.7 81.9 82.5

Upvotes: 3

Related Questions