Reputation: 1232
I am looking to create an additional column named "percentile", the percentile will be based off the sold quotes quotes and I do not want to create a window function on it, the percentile is should be based off the entire dataset. See below, the data is currently in descending order by SOLD_QUOOTES, what ideally the first row we see in the image should be the 99.99% percentile and should lower cascading down the table.
Excepted output
Upvotes: 2
Views: 2249
Reputation: 1223
Maybe something like,
library(dplyr)
df <- tibble(sold_quotes = sample(1e6, 1e3, replace = TRUE))
pctiles <- seq(0, 1, 0.001)
df %>%
arrange(desc(sold_quotes)) %>%
mutate(percentile = cut(sold_quotes,
quantile(sold_quotes,
probs = pctiles),
labels = pctiles[2:length(pctiles)]*100))
#> # A tibble: 1,000 x 2
#> sold_quotes percentile
#> <int> <fct>
#> 1 999562 100
#> 2 996533 99.9
#> 3 996260 99.8
#> 4 995499 99.7
#> 5 994984 99.6
#> 6 994937 99.5
#> 7 994130 99.4
#> 8 993001 99.3
#> 9 992902 99.2
#> 10 990298 99.1
#> # … with 990 more rows
The percentile calculation doesn't depend on rearranging sold_quotes
in descending order; you'll get the correct result without it. I was just mirroring your example.
Upvotes: 5