Reputation: 1232
I am attempting to create an additional column in my dataset that bucketizes percentiles. Ideally I would to create logic as follows:
CASE
WHEN percentile >= 75 AND percentile < 90 THEN "75%-89% Percentile"
WHEN percentile >= 50 AND percentile < 75 THEN "50%-75% Percentile"
END
What I have attempted dplyr is the following:
mutate(Bucket = case_when(as.double(percentile) >= 90 ~ "90%-100% Percentile",
as.double(percentile) >= 75 & as.double(percentile) < 90 ~ "75%-89% Percentile",
as.double(percentile) < 75 & as.double(percentile) >= 50 ~ "50%-75% Percentile",
as.double(percentile) < 50 & as.double(percentile) >= 25 ~ "25%-50% Percentile",
as.double(percentile) < 25 & as.double(percentile) >= 0 ~ "0%-25% Percentile"))
However it is not bucketizing correctly, see a sample of the results in the screenshot. The bucket flag for these percentiles should be "75%-89% Percentile" :
Upvotes: 1
Views: 142
Reputation: 887551
The column percentile
is factor
. We need to convert to character
class first and then to numeric
library(dplyr)
df1 %>%
mutate(percentile = as.numeric(as.character(percentile))) %>%
...
What happens is that when we directly coerce to numeric/integer, it gets coerced to integer storage values instead of the actual values
v1 <- factor(c(81.9, 82.7, 81.9, 82.5))
as.numeric(v1)
#[1] 1 3 1 2
is different than the following
as.numeric(as.character(v1))
#[1] 81.9 82.7 81.9 82.5
Or probably faster with levels
as.numeric(levels(v1)[v1])
#[1] 81.9 82.7 81.9 82.5
Upvotes: 3