lebelinoz
lebelinoz

Reputation: 5068

How to compute quantiles on groups

I want to separate by data into quantiles, same as in this great question

The problem is that I want to do it every day in a time series:

set.seed(123)
temp.all <- data.frame(date = c(rep(Sys.Date() - 1, 12), rep(Sys.Date(), 12)), 
    name=letters[c(1:12, 1:12)], value=rnorm(24))

At the moment, I'm solving with a for-loop:

library(dplyr)
for (d in unique(temp.all$date)) {
    temp = filter(temp.all, date == d)
    temp$quartile <- with(temp, factor(
                        findInterval( val, c(-Inf,
                           quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE), 
                        labels=c("Q1","Q2","Q3","Q4")
  ))
  # ...and doing rbind on 'temp' to reconstruct temp.all with quartiles
}

Any ideas on how to avoid the dreaded for-loop? Is there maybe a group_by way of doing this?

Upvotes: 1

Views: 2718

Answers (2)

Marius
Marius

Reputation: 60070

Using group_by you can just do:

library(lubridate)

temp.all = temp.all %>%
    # lubridate::date(date) might be necessary if you have datetimes
    group_by(date) %>%
    mutate(quartile = cut(value, breaks = 4, labels = paste0("Q", 1:4)))

dplyr also has a function ntile which should behave similarly to cut and should give the same results.

Upvotes: 4

Uwe
Uwe

Reputation: 42544

For the sake of completeness (and to correct a wrong answer), here is also a data.table solution:

library(data.table)
setDT(temp.all)[, quartile := cut(value, breaks = 4L, labels = paste0("Q", 1:4)), by = date]
temp.all
          date name       value quartile
 1: 2017-08-28    a -0.56047565       Q1
 2: 2017-08-28    b -0.23017749       Q2
 3: 2017-08-28    c  1.55870831       Q4
 4: 2017-08-28    d  0.07050839       Q2
 5: 2017-08-28    e  0.12928774       Q2
 6: 2017-08-28    f  1.71506499       Q4
...
18: 2017-08-29    f -1.96661716       Q1
19: 2017-08-29    g  0.70135590       Q3
20: 2017-08-29    h -0.47279141       Q2
21: 2017-08-29    i -1.06782371       Q1
22: 2017-08-29    j -0.21797491       Q2
23: 2017-08-29    k -1.02600445       Q2
24: 2017-08-29    l -0.72889123       Q2
          date name       value quartile

Note that the results are grouped by date as requested by the OP and not by day(date) which would aggregate the day of month, e.g., January 1st, February 1st, March 1st, etc.

Further note that only one additional column quartile is being added to temp.all in place, i.e., without copying the whole data set in order to save memory and time (which might become relevant when dealing with large data sets).

Upvotes: 1

Related Questions