How to create a variable with the quantiles of another one in R?

I'm traying to create a variable using "dplyr" command mutate, which must indicate the quantile of another variable.

For example:

# 1.  Fake data:
data <- data.frame(
  "id" = seq(1:20),
  "score" = round(rnorm(20,30,20)))

# 2. Creating varaible 'Quantile_5'
data <-data %>% 
  mutate(Quntile_5 = ????)

So far I have created a function that identifies and returns the quantile as a factor, and which actually works

# 3. Create a function:
quantile5 <- function(x){
  x = ifelse(
    x < quantile(x,0.2),1,
    ifelse(x >= quantile(x,0.2) & x < quantile(x,0.4),2,
           ifelse(x >= quantile(x,0.4) & x < quantile(x,0.6),3,
                  ifelse(x >= quantile(x,0.6) & x < quantile(x,0.8),4,5
                         ))))
  return(as.factor(x))
}

# 4. Running the code:
data <-data %>% 
  mutate(Quntile_5 = quantile5(score))

# 5. Result:
data

   id score Quntile_5
1   1    55         5
2   2    56         5
3   3    26         3
4   4    42         3
5   5    41         3
6   6    26         3
7   7    57         5
8   8    12         1
9   9    21         2
10 10    25         2
11 11    37         3
12 12    18         2
13 13    54         5
14 14    47         4
15 15    52         4
16 16    -4         1
17 17    53         4
18 18    51         4
19 19    -7         1
20 20    -2         1

But if I want to create for example a variable "Quantile_100" as a factor indicating in which position from 1 to 100 each observation is (in the context of larger data sets), this is not a great solution. Is there any easier way to create these quintile variables?

Upvotes: 3

Views: 4467

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389325

Here are two options with cut :

1.

library(dplyr)

data %>% mutate(quantile100 = cut(score, 100, label = FALSE))
#This is similar to @Anoushiravan R `findInterval` function.
data %>% 
    mutate(quantile100 = cut(score, unique(quantile(score, seq(0, 1, 0.01))), labels = FALSE))

Upvotes: 6

Anoushiravan R
Anoushiravan R

Reputation: 21938

I hope this is what you were looking for:

library(dplyr)

data <- data.frame(
  "id" = seq(1:20),
  "score" = round(rnorm(20,30,20)))


data %>%
  mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.01)), 
                                    rightmost.closed = TRUE)) %>%
  slice_head(n = 10)

   id score quantile100
1   1    59          95
2   2    47          90
3   3    83         100
4   4    33          53
5   5     7          11
6   6    26          43
7   7    16          16
8   8    18          27
9   9    33          53
10 10    47          90

I chose to close the right most bin so that the maximum category does not go beyond 100. We can also verify it with your own example which leads to the same result:


df %>%
  mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.2)), 
                                    rightmost.closed = TRUE)) %>%
  slice_head(n = 10)


   id score quantile5
1   1    55         5
2   2    56         5
3   3    26         3
4   4    42         3
5   5    41         3
6   6    26         3
7   7    57         5
8   8    12         1
9   9    21         2
10 10    25         2

Data

structure(list(id = 1:20, score = c(55L, 56L, 26L, 42L, 41L, 
26L, 57L, 12L, 21L, 25L, 37L, 18L, 54L, 47L, 52L, -4L, 53L, 51L, 
-7L, -2L)), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20"))

Upvotes: 2

Related Questions