Reputation: 165

How to calculate percentile with group by?

I have a data.table with over ten thousand of rows and it looks like this:

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

    ID result_2010 result_2011 years
 1:  1        TRUE       FALSE  15.0
 2:  2       FALSE        TRUE  16.5
 3:  3        TRUE       FALSE  31.0
 4:  4       FALSE       FALSE   1.0
 5:  5       FALSE       FALSE  40.2
 6:  6        TRUE       FALSE   0.3
 7:  7       FALSE        TRUE  12.0
 8:  8       FALSE       FALSE  22.7
 9:  9        TRUE       FALSE  19.0
10: 10       FALSE        TRUE  12.0

For "result_2010" and "result_2011" I want to make a percentile analysis of the "years" but only if the value in for the individual is "TRUE". The code that I tried seems to work, but it gives back the same results for "result_2010" and "result_2011", which is for sure incorrect:

DT1 %>%
  group_by(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  group_by(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

Could anyone help how to correct my code?

Upvotes: 5

Answers (5)

blrun

Reputation: 489

This will be my first answer, so please forgive me if I do something wrong. By reading your question carefully, you wanted someone to help you improve your code. Here it is, please.

library(tidyverse)
library(data.table)

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
DT1 %>%
  filter(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  filter(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

In the first case, it returns the values 4.7, 11.3, 17, 22, 27.4, 16.3. In the second case, it returns 12, 12, 12, 14.2, 15.6, 13.5. I see so many different answers here. Although I honestly admit some of them I don't understand (yet). I really like the solution with quantile%>% tibble%>% bind_cols. But knock on what I have a low reputation for pointing to this as helpful.

Upvotes: 1

Marek Fiołka

Reputation: 4949

library(tidyverse)
DT1 <- tibble(ID = 1:10,
                  result_2010 = c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
                  result_2011 = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

tibble(
  year = c(2010, 2011),
  data = list(DT1$years[DT1$result_2010],
              DT1$years[DT1$result_2011])
) %>% group_by(year) %>% 
  group_modify(~fQuantMean(.x$data[[1]]))

output

# A tibble: 2 x 7
# Groups:   year [2]
   year  `0%` `25%` `50%` `75%` `100%`  mean
  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1  2010   0.3  11.3    17  22     31    16.3
2  2011  12    12      12  14.2   16.5  13.5

An update for anyone interested!

Hello dear colleagues. As you can see, each task can be solved in several different ways. So I decided to compare the methods proposed by here. Since @Gabesz mentioned that he has 10000 observations, I decided to check each of the solutions in terms of performance.

n=10000
set.seed(1234)
DT1 <- tibble(ID = 1:n,
              result_2010 = sample(c(TRUE, FALSE), n, replace = TRUE),
              result_2011 = sample(c(TRUE, FALSE), n, replace = TRUE),
              years = rnorm(n, 20, 5))

Then I did a little benchmark

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

fFiolka = function(){
  tibble(
    year = c(2010, 2011),
    data = list(DT1$years[DT1$result_2010],
                DT1$years[DT1$result_2011])
  ) %>% group_by(year) %>% 
    group_modify(~fQuantMean(.x$data[[1]]))
}
fFiolka()
# # A tibble: 2 x 7
# # Groups:   year [2]
#    year     `0%` `25%` `50%` `75%` `100%`  mean
#    <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1  2010 -0.00697  16.4  19.9  23.3   38.6  19.9
# 2  2011 -0.633    16.5  20.0  23.4   38.6  20.0

library(data.table)

fjay_sf = function(){
  melt(DT1, c(1, 4), 2:3) |>
    transform(variable=substring(variable, 8)) |>
    subset(value == TRUE) |>
    with(aggregate(list(q=years), list(year=variable), \(x)
                   c(quantile(x), mean=mean(x))))
}
fjay_sf()
# year         q.0%        q.25%        q.50%        q.75%       q.100%       q.mean
# 1 2010 -0.006968224 16.447077579 19.947385976 23.348571278 38.636456902 19.944574420
# 2 2011 -0.633138113 16.530534403 20.043636844 23.424378551 38.636456902 20.013130400
# Warning message:
#   In melt(DT1, c(1, 4), 2:3) :
#   The melt generic in data.table has been passed a tbl_df and will attempt to redirect 
#   to the relevant reshape2 method; please note that reshape2 is deprecated, and this 
#   redirection is now deprecated as well. To continue using melt methods from reshape2
#    while both libraries are attached, e.g. melt.list, you can prepend the namespace 
#    like reshape2::melt(DT1). In the next version, this warning will become an error.


cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

fShah = function(){
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)
}
fShah()
# # A tibble: 2 x 8
#   Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1 result_2010  13.5  16.4  19.9  23.3  26.4   19.9  19.9
# 2 result_2011  13.4  16.5  20    23.4  26.6   20    20  

library(microbenchmark)
ggplot2::autoplot(microbenchmark(fFiolka(), fjay_sf(), fShah(), times=100))

Hope the chart above explains it all.

@r2evans please don't blame me for skipping your solution but it caused me some errors.

Upvotes: 4

r2evans

Reputation: 160437

A melt/dcast option:

library(data.table)
tmp <- melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, .(variable = c(paste0("q", c(10, 25, 50, 75, 90)), "mu"),
         value = c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                  mean(years)))
    , by = .(year)]
tmp
#            year variable  value
#          <fctr>   <char>  <num>
#  1: result_2010      q10  4.710
#  2: result_2010      q25 11.325
#  3: result_2010      q50 17.000
#  4: result_2010      q75 22.000
#  5: result_2010      q90 27.400
#  6: result_2010       mu 16.325
#  7: result_2011      q10 12.000
#  8: result_2011      q25 12.000
#  9: result_2011      q50 12.000
# 10: result_2011      q75 14.250
# 11: result_2011      q90 15.600
# 12: result_2011       mu 13.500

dcast(tmp, year ~ variable, value.var = "value")
#           year     mu   q10    q25   q50   q75   q90
#         <fctr>  <num> <num>  <num> <num> <num> <num>
# 1: result_2010 16.325  4.71 11.325    17 22.00  27.4
# 2: result_2011 13.500 12.00 12.000    12 14.25  15.6

You have complete control over the names, just assign then (in order) within the "variable" column (you might choose to name it better).

Or a solitary melt:

melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, setNames(as.list(c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                          mean(years))),
                c(paste0("q", c(10, 25, 50, 75, 90)), "mu"))
    , by = .(year)][]
#           year   q10    q25   q50   q75   q90     mu
#         <fctr> <num>  <num> <num> <num> <num>  <num>
# 1: result_2010  4.71 11.325    17 22.00  27.4 16.325
# 2: result_2011 12.00 12.000    12 14.25  15.6 13.500

Names are again controlled easily, now in the 2nd argument of setNames. The premise is that returning a named-list in data.table processing will convert it to named columns, so any function that does this is easily usable.

Upvotes: 2

jay.sf

Reputation: 72803

Using melt and aggregate.

library(data.table)
melt(DT1, c(1, 4), 2:3) |>
  transform(variable=substring(variable, 8)) |>
  subset(value == TRUE) |>
  with(aggregate(list(q=years), list(year=variable), \(x)
                 c(quantile(x), mean=mean(x))))
#   year   q.0%  q.25%  q.50%  q.75% q.100% q.mean
# 1 2010  0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500

Note: Please use R>=4.1 for the |> pipes and \(x) function shorthand notation (or write function(x)).

Upvotes: 4

Ronak Shah

Reputation: 388982

You may write-up a function and run it on every result column.

library(tidyverse)

cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)

#  Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#1 result_2010   4.7  11.3    17  22    27.4     17  16.3
#2 result_2011  12    12      12  14.2  15.6     12  13.5

Upvotes: 2

How to calculate percentile with group by?

Answers (5)

Related Questions