M.Qasim
M.Qasim

Reputation: 1878

calculating n quantiles by group in tidyverse

I have a unique problem where I would like to add a column of percentiles for each group in a data frame. Here is how my data look like:

library(tidyverse)
set.seed(123)

df <- tibble(id = 1:100, 
                 group = rep(letters[1:4], 25), 
                 x = c(sample(1:100, 25, replace = T),
                       sample(101:200, 25, replace = T),
                       sample(201:300, 25, replace = T),
                       sample(301:400, 25, replace = T)))

> df
# A tibble: 100 x 3
      id group     x
   <int> <chr> <int>
 1     1 a        78
 2     2 b        80
 3     3 c         7
 4     4 d       100
 5     5 a        45
 6     6 b        76
 7     7 c        25
 8     8 d        91
 9     9 a        13
10    10 b        84
# ... with 90 more rows

# Function to create a table ten percentiles for a numeric vector
percentiles_table <- function(x) {
  res <- round(quantile(x, probs = seq(from=.1, to=1, by=0.1)), 0)
  res <- data.frame(percentile = names(res), to = res )
  res <- res %>% 
    mutate(from = lag(to, default = 0)) %>% 
    select(from,to,percentile)
}

# Table of percentiles
percentiles <- df %>% 
  group_by(group) %>% 
  summarise(percentiles_table(x)) %>% 
  ungroup()

> percentiles
# A tibble: 40 x 4
   group  from    to percentile
   <chr> <dbl> <dbl> <chr>     
 1 a         0    25 10%       
 2 a        25    71 20%       
 3 a        71   106 30%       
 4 a       106   125 40%       
 5 a       125   198 50%       
 6 a       198   236 60%       
 7 a       236   278 70%       
 8 a       278   325 80%       
 9 a       325   379 90%       
10 a       379   389 100%   

I would like to add the percentile column to df for each group where the value of x falls between from and to.

There might be some way to calculate the percentile column directly without having it calculated in a separated data.frame and then appending it back to df.

Upvotes: 0

Views: 672

Answers (4)

dash2
dash2

Reputation: 2262

A one-liner with my santoku package:

library(santoku)
df |> 
  group_by(group) |> 
  mutate(
    percentile = chop_quantiles(x, 0:100/100, 
                                labels = lbl_endpoint())
  )
# A tibble: 100 × 4
# Groups:   group [4]
      id group     x percentile
   <int> <chr> <int> <fct>     
 1     1 a        35 8%        
 2     2 b        97 20%       
 3     3 c        39 4%        
 4     4 d        20 8%        
 5     5 a        89 16%   
...

Upvotes: 2

M.Qasim
M.Qasim

Reputation: 1878

I got this working solution.

percentile_ranks <- function(x) { 
  res <- trunc(rank(x))/length(x) * 100
  res <- floor(res/10) }

df <- df %>% 
  group_by(group) %>% 
  arrange(x) %>% 
  mutate(percentile = percentile_ranks(x)) %>%
  mutate(percentile_pct = paste0(percentile*10,"%")) %>% 
  ungroup() %>% 
  arrange(id) # original data.frame order

Upvotes: 0

det
det

Reputation: 5232

Using data.table:

setDT(df)[
  , 
  percentile := cut(
    x, 
    quantile(x, seq(0, 1, 0.1)),
    include.lowest = TRUE,
    labels = paste0(seq(10, 100, 10), "%")
  ), 
  by = group
]

Upvotes: 2

install.packages("zoo")

library(zoo)

y=as.data.frame(c(0:max(percentiles$to)))

y=merge(y,unique(percentiles[,c(1)]))

y=merge(y,percentiles[,c(1,2,4)], by.x = c("group","c(0:max(percentiles$to))"), by.y = c("group","from"), all.x = TRUE)

y=na.locf(y)

df=merge(df,y, all.x = TRUE, by.x = c("group","x"), by.y = c("group","c(0:max(percentiles$to))"))

Upvotes: 1

Related Questions