Reputation: 21
I'd like to get quantiles on frequency values. For example, suppose I have data like the following:
length frequency 1 13 # There are 13 length 1 items. 2 20 # There are 20 length 2 items. 8 17 10 25 ... [10000+ more entries in file]
So I'd like to get quantiles for certain values like 0.05, 0.10, 0.50, 0.90, 0.95, 0.99. Also, I'd like to get the rank of a certain length. How can I do that on R or on Python?
Upvotes: 1
Views: 1981
Reputation: 56915
If you want a base-R solution, try this (it works the same as @jeremycg's dplyr
solution - calculate cumulative frequency for each length, and to ask for a specific quantile you find the first length with cumulative frequency >= that quantile.
dta <- data.frame(length=c(1,2,8,10), frequency=c(13,20,17,25))
dta$cumfreq <- cumsum(dta$frequency)/sum(dta$frequency)
qtle <- 0.5 # quantile to find
dta$length[dta$cumfreq >= qtle][1] # in a tie, picks the lower length
To rank the lengths by frequency see ?rank
rank(dta$frequency) # ranks frequencies, increasing
rank(-dta$frequency) # rank decreasing
rank(-dta$frequency)[dta$length == 8] # rank of length 8: 3rd most common
Upvotes: 1
Reputation: 24945
Using dplyr
, first create a column with cumulative proportion:
library(dplyr)
dta1<- dta %>% arrange(length) %>%
mutate(quartile = cumsum(frequency / sum(frequency)))
Now we can simply find the first of each which is greater than the required quartile (in this case 0.5):
dta %>% filter(quartile > 0.5) %>%
slice(1)
NB this quartile finder is particularly dumb, up to you to fix for ties etc.
Upvotes: 3