Reputation: 165
I have a data.table with over ten thousand of rows and it looks like this:
DT1 <- data.table(ID = 1:10,
result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
ID result_2010 result_2011 years
1: 1 TRUE FALSE 15.0
2: 2 FALSE TRUE 16.5
3: 3 TRUE FALSE 31.0
4: 4 FALSE FALSE 1.0
5: 5 FALSE FALSE 40.2
6: 6 TRUE FALSE 0.3
7: 7 FALSE TRUE 12.0
8: 8 FALSE FALSE 22.7
9: 9 TRUE FALSE 19.0
10: 10 FALSE TRUE 12.0
For "result_2010" and "result_2011" I want to make a percentile analysis of the "years" but only if the value in for the individual is "TRUE". The code that I tried seems to work, but it gives back the same results for "result_2010" and "result_2011", which is for sure incorrect:
DT1 %>%
group_by(result_2010 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
DT1 %>%
group_by(result_2011 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
Could anyone help how to correct my code?
Upvotes: 5
Views: 4218
Reputation: 489
This will be my first answer, so please forgive me if I do something wrong. By reading your question carefully, you wanted someone to help you improve your code. Here it is, please.
library(tidyverse)
library(data.table)
DT1 <- data.table(ID = 1:10,
result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
DT1 %>%
filter(result_2010 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
DT1 %>%
filter(result_2011 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
In the first case, it returns the values 4.7, 11.3, 17, 22, 27.4, 16.3. In the second case, it returns 12, 12, 12, 14.2, 15.6, 13.5. I see so many different answers here. Although I honestly admit some of them I don't understand (yet). I really like the solution with quantile%>% tibble%>% bind_cols. But knock on what I have a low reputation for pointing to this as helpful.
Upvotes: 1
Reputation: 4949
library(tidyverse)
DT1 <- tibble(ID = 1:10,
result_2010 = c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
result_2011 = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
fQuantMean = function(x) t(quantile(x)) %>%
as_tibble() %>% bind_cols(mean = mean(x))
tibble(
year = c(2010, 2011),
data = list(DT1$years[DT1$result_2010],
DT1$years[DT1$result_2011])
) %>% group_by(year) %>%
group_modify(~fQuantMean(.x$data[[1]]))
output
# A tibble: 2 x 7
# Groups: year [2]
year `0%` `25%` `50%` `75%` `100%` mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2010 0.3 11.3 17 22 31 16.3
2 2011 12 12 12 14.2 16.5 13.5
An update for anyone interested!
Hello dear colleagues. As you can see, each task can be solved in several different ways. So I decided to compare the methods proposed by here. Since @Gabesz mentioned that he has 10000 observations, I decided to check each of the solutions in terms of performance.
n=10000
set.seed(1234)
DT1 <- tibble(ID = 1:n,
result_2010 = sample(c(TRUE, FALSE), n, replace = TRUE),
result_2011 = sample(c(TRUE, FALSE), n, replace = TRUE),
years = rnorm(n, 20, 5))
Then I did a little benchmark
fQuantMean = function(x) t(quantile(x)) %>%
as_tibble() %>% bind_cols(mean = mean(x))
fFiolka = function(){
tibble(
year = c(2010, 2011),
data = list(DT1$years[DT1$result_2010],
DT1$years[DT1$result_2011])
) %>% group_by(year) %>%
group_modify(~fQuantMean(.x$data[[1]]))
}
fFiolka()
# # A tibble: 2 x 7
# # Groups: year [2]
# year `0%` `25%` `50%` `75%` `100%` mean
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2010 -0.00697 16.4 19.9 23.3 38.6 19.9
# 2 2011 -0.633 16.5 20.0 23.4 38.6 20.0
library(data.table)
fjay_sf = function(){
melt(DT1, c(1, 4), 2:3) |>
transform(variable=substring(variable, 8)) |>
subset(value == TRUE) |>
with(aggregate(list(q=years), list(year=variable), \(x)
c(quantile(x), mean=mean(x))))
}
fjay_sf()
# year q.0% q.25% q.50% q.75% q.100% q.mean
# 1 2010 -0.006968224 16.447077579 19.947385976 23.348571278 38.636456902 19.944574420
# 2 2011 -0.633138113 16.530534403 20.043636844 23.424378551 38.636456902 20.013130400
# Warning message:
# In melt(DT1, c(1, 4), 2:3) :
# The melt generic in data.table has been passed a tbl_df and will attempt to redirect
# to the relevant reshape2 method; please note that reshape2 is deprecated, and this
# redirection is now deprecated as well. To continue using melt methods from reshape2
# while both libraries are attached, e.g. melt.list, you can prepend the namespace
# like reshape2::melt(DT1). In the next version, this warning will become an error.
cols <- grep('result_', names(DT1), value = TRUE)
get_stats_fun <- function(DT, col) {
DT %>%
filter(.data[[col]] == "TRUE") %>%
summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
"median" = round(median(years), 1),
"Mean" = round(mean(years),1)) %>%
unnest_wider(quantile)
}
fShah = function(){
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
mutate(Year = cols)
}
fShah()
# # A tibble: 2 x 8
# Year `10%` `25%` `50%` `75%` `90%` median Mean
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 result_2010 13.5 16.4 19.9 23.3 26.4 19.9 19.9
# 2 result_2011 13.4 16.5 20 23.4 26.6 20 20
library(microbenchmark)
ggplot2::autoplot(microbenchmark(fFiolka(), fjay_sf(), fShah(), times=100))
Hope the chart above explains it all.
@r2evans please don't blame me for skipping your solution but it caused me some errors.
Upvotes: 4
Reputation: 160437
A melt
/dcast
option:
library(data.table)
tmp <- melt(DT1, c("ID", "years"), variable.name = "year"
)[ value == "TRUE",
][, .(variable = c(paste0("q", c(10, 25, 50, 75, 90)), "mu"),
value = c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)),
mean(years)))
, by = .(year)]
tmp
# year variable value
# <fctr> <char> <num>
# 1: result_2010 q10 4.710
# 2: result_2010 q25 11.325
# 3: result_2010 q50 17.000
# 4: result_2010 q75 22.000
# 5: result_2010 q90 27.400
# 6: result_2010 mu 16.325
# 7: result_2011 q10 12.000
# 8: result_2011 q25 12.000
# 9: result_2011 q50 12.000
# 10: result_2011 q75 14.250
# 11: result_2011 q90 15.600
# 12: result_2011 mu 13.500
dcast(tmp, year ~ variable, value.var = "value")
# year mu q10 q25 q50 q75 q90
# <fctr> <num> <num> <num> <num> <num> <num>
# 1: result_2010 16.325 4.71 11.325 17 22.00 27.4
# 2: result_2011 13.500 12.00 12.000 12 14.25 15.6
You have complete control over the names, just assign then (in order) within the "variable"
column (you might choose to name it better).
Or a solitary melt
:
melt(DT1, c("ID", "years"), variable.name = "year"
)[ value == "TRUE",
][, setNames(as.list(c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)),
mean(years))),
c(paste0("q", c(10, 25, 50, 75, 90)), "mu"))
, by = .(year)][]
# year q10 q25 q50 q75 q90 mu
# <fctr> <num> <num> <num> <num> <num> <num>
# 1: result_2010 4.71 11.325 17 22.00 27.4 16.325
# 2: result_2011 12.00 12.000 12 14.25 15.6 13.500
Names are again controlled easily, now in the 2nd argument of setNames
. The premise is that returning a named-list
in data.table
processing will convert it to named columns, so any function that does this is easily usable.
Upvotes: 2
Reputation: 72803
Using melt
and aggregate
.
library(data.table)
melt(DT1, c(1, 4), 2:3) |>
transform(variable=substring(variable, 8)) |>
subset(value == TRUE) |>
with(aggregate(list(q=years), list(year=variable), \(x)
c(quantile(x), mean=mean(x))))
# year q.0% q.25% q.50% q.75% q.100% q.mean
# 1 2010 0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500
Note: Please use R>=4.1 for the |>
pipes and \(x)
function shorthand notation (or write function(x)
).
Upvotes: 4
Reputation: 388982
You may write-up a function and run it on every result
column.
library(tidyverse)
cols <- grep('result_', names(DT1), value = TRUE)
get_stats_fun <- function(DT, col) {
DT %>%
filter(.data[[col]] == "TRUE") %>%
summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
"median" = round(median(years), 1),
"Mean" = round(mean(years),1)) %>%
unnest_wider(quantile)
}
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
mutate(Year = cols)
# Year `10%` `25%` `50%` `75%` `90%` median Mean
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 result_2010 4.7 11.3 17 22 27.4 17 16.3
#2 result_2011 12 12 12 14.2 15.6 12 13.5
Upvotes: 2