Reputation: 818
I'm traying to create a variable using "dplyr" command mutate, which must indicate the quantile of another variable.
For example:
# 1. Fake data:
data <- data.frame(
"id" = seq(1:20),
"score" = round(rnorm(20,30,20)))
# 2. Creating varaible 'Quantile_5'
data <-data %>%
mutate(Quntile_5 = ????)
So far I have created a function that identifies and returns the quantile as a factor, and which actually works
# 3. Create a function:
quantile5 <- function(x){
x = ifelse(
x < quantile(x,0.2),1,
ifelse(x >= quantile(x,0.2) & x < quantile(x,0.4),2,
ifelse(x >= quantile(x,0.4) & x < quantile(x,0.6),3,
ifelse(x >= quantile(x,0.6) & x < quantile(x,0.8),4,5
))))
return(as.factor(x))
}
# 4. Running the code:
data <-data %>%
mutate(Quntile_5 = quantile5(score))
# 5. Result:
data
id score Quntile_5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
11 11 37 3
12 12 18 2
13 13 54 5
14 14 47 4
15 15 52 4
16 16 -4 1
17 17 53 4
18 18 51 4
19 19 -7 1
20 20 -2 1
But if I want to create for example a variable "Quantile_100" as a factor indicating in which position from 1 to 100 each observation is (in the context of larger data sets), this is not a great solution. Is there any easier way to create these quintile variables?
Upvotes: 3
Views: 4467
Reputation: 389325
Here are two options with cut
:
1.
library(dplyr)
data %>% mutate(quantile100 = cut(score, 100, label = FALSE))
#This is similar to @Anoushiravan R `findInterval` function.
data %>%
mutate(quantile100 = cut(score, unique(quantile(score, seq(0, 1, 0.01))), labels = FALSE))
Upvotes: 6
Reputation: 21938
I hope this is what you were looking for:
library(dplyr)
data <- data.frame(
"id" = seq(1:20),
"score" = round(rnorm(20,30,20)))
data %>%
mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.01)),
rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile100
1 1 59 95
2 2 47 90
3 3 83 100
4 4 33 53
5 5 7 11
6 6 26 43
7 7 16 16
8 8 18 27
9 9 33 53
10 10 47 90
I chose to close the right most bin so that the maximum category does not go beyond 100. We can also verify it with your own example which leads to the same result:
df %>%
mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.2)),
rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
Data
structure(list(id = 1:20, score = c(55L, 56L, 26L, 42L, 41L,
26L, 57L, 12L, 21L, 25L, 37L, 18L, 54L, 47L, 52L, -4L, 53L, 51L,
-7L, -2L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
Upvotes: 2