How to create percentiles in R using dplyr with data frame?

Question

I am looking to create an additional column named "percentile", the percentile will be based off the sold quotes quotes and I do not want to create a window function on it, the percentile is should be based off the entire dataset. See below, the data is currently in descending order by SOLD_QUOOTES, what ideally the first row we see in the image should be the 99.99% percentile and should lower cascading down the table.

Excepted output

Kieran · Accepted Answer

Maybe something like,


library(dplyr)

df <- tibble(sold_quotes = sample(1e6, 1e3, replace = TRUE))

pctiles <- seq(0, 1, 0.001)

df %>% 
  arrange(desc(sold_quotes)) %>% 
  mutate(percentile = cut(sold_quotes, 
                      quantile(sold_quotes, 
                               probs = pctiles), 
                      labels = pctiles[2:length(pctiles)]*100)) 
#> # A tibble: 1,000 x 2
#>    sold_quotes percentile
#>                
#>  1      999562 100       
#>  2      996533 99.9      
#>  3      996260 99.8      
#>  4      995499 99.7      
#>  5      994984 99.6      
#>  6      994937 99.5      
#>  7      994130 99.4      
#>  8      993001 99.3      
#>  9      992902 99.2      
#> 10      990298 99.1      
#> # … with 990 more rows

The percentile calculation doesn't depend on rearranging sold_quotes in descending order; you'll get the correct result without it. I was just mirroring your example.

How to create percentiles in R using dplyr with data frame?

Answers (1)

Related Questions