hosoft
hosoft

Reputation: 21

R quantile on frequency values

I'd like to get quantiles on frequency values. For example, suppose I have data like the following:

length frequency
1      13    # There are 13 length 1 items.
2      20    # There are 20 length 2 items.
8      17
10     25
...
[10000+ more entries in file]

So I'd like to get quantiles for certain values like 0.05, 0.10, 0.50, 0.90, 0.95, 0.99. Also, I'd like to get the rank of a certain length. How can I do that on R or on Python?

Upvotes: 1

Views: 1981

Answers (2)

mathematical.coffee
mathematical.coffee

Reputation: 56915

If you want a base-R solution, try this (it works the same as @jeremycg's dplyr solution - calculate cumulative frequency for each length, and to ask for a specific quantile you find the first length with cumulative frequency >= that quantile.

dta <- data.frame(length=c(1,2,8,10), frequency=c(13,20,17,25))
dta$cumfreq <- cumsum(dta$frequency)/sum(dta$frequency)

qtle <- 0.5 # quantile to find
dta$length[dta$cumfreq >= qtle][1] # in a tie, picks the lower length

To rank the lengths by frequency see ?rank

rank(dta$frequency) # ranks frequencies, increasing
rank(-dta$frequency) # rank decreasing
rank(-dta$frequency)[dta$length == 8] # rank of length 8: 3rd most common

Upvotes: 1

jeremycg
jeremycg

Reputation: 24945

Using dplyr, first create a column with cumulative proportion:

library(dplyr)
dta1<- dta %>% arrange(length) %>%
      mutate(quartile = cumsum(frequency / sum(frequency)))

Now we can simply find the first of each which is greater than the required quartile (in this case 0.5):

dta %>% filter(quartile > 0.5) %>%
        slice(1)

NB this quartile finder is particularly dumb, up to you to fix for ties etc.

Upvotes: 3

Related Questions