Reputation: 20463
Consider the following interactive example that generates a summary table:
library(dplyr)
tg <- ToothGrowth
ci_int <- 0.95
tg %>%
group_by(supp, dose) %>%
summarise(N = n(),
mean = mean(len, na.rm = T),
sd = sd(len, na.rm = T),
se = sd / sqrt(N),
ci = se * qt(ci_int / 2 + 0.50, N - 1))
# supp dose N mean sd se ci
# (fctr) (dbl) (int) (dbl) (dbl) (dbl) (dbl)
# 1 OJ 0.5 10 13.23 4.459709 1.4102837 3.190283
# 2 OJ 1.0 10 22.70 3.910953 1.2367520 2.797727
# 3 OJ 2.0 10 26.06 2.655058 0.8396031 1.899314
# 4 VC 0.5 10 7.98 2.746634 0.8685620 1.964824
# 5 VC 1.0 10 16.77 2.515309 0.7954104 1.799343
# 6 VC 2.0 10 26.14 4.797731 1.5171757 3.432090
I would like to convert this to a function and abstract away the data.frame
, measure
variable, groupvars
grouping variables, and the conf.int
. Here's a start:
library(lazyeval)
summarySE <- function(df, measure, groupvars, conf.int = 0.95) {
summary_dots <- list(
~ n(),
interp(~ mean(var, na.rm = T), var = as.name(measure)),
interp(~ sd(var, na.rm = T), var = as.name(measure))
)
df %>%
group_by_(.dots = groupvars) %>%
summarise_(.dots = setNames(summary_dots, c("N", "mean", "sd")))
}
summarySE(tg, "len", c("supp", "dose"))
Which yields:
# supp dose N mean sd
# (fctr) (dbl) (int) (dbl) (dbl)
# 1 OJ 0.5 10 13.23 4.459709
# 2 OJ 1.0 10 22.70 3.910953
# 3 OJ 2.0 10 26.06 2.655058
# 4 VC 0.5 10 7.98 2.746634
# 5 VC 1.0 10 16.77 2.515309
# 6 VC 2.0 10 26.14 4.797731
However, this doesn't feel very DRY? In addition, I'm not sure how to implement se
and ci
without getting overtly complex/verbose? Perhaps there's a better approach altogether or perhaps this should be split up into several functions?
How can I convert the summary table above to a function so that I can pass it any combination of a data.frame
with different measure
and groupvars
with the "spirit" of dplyr
?
Upvotes: 4
Views: 313
Reputation: 35307
I don't really quite get why the calculation of the SE and CI are more complicated than what you were doing already.
I used the ...
arguments to capture your grouping arguments, as that seems a bit easier in use.
Overall I end up with the following function:
summarySE <- function(.data, measure, ..., conf.int = 0.95) {
dots <- lazyeval::lazy_dots(...)
measure <- lazyeval::lazy(measure)
summary_dots <- list(
N = ~ n(),
mean = lazyeval::interp(~ mean(var, na.rm = T), var = measure),
sd = lazyeval::interp(~ sd(var, na.rm = T), var = measure),
se = ~ sd / sqrt(N),
ci = ~ se * qt(conf.int / 2 + 0.50, N - 1))
.data <- dplyr::group_by_(.data, .dots = dots)
dplyr::summarise_(.data, .dots = summary_dots)
}
You could make this into an SE and NSE version if you'd like (and as Hadley might do).
Usage:
summarySE(tg, len, supp, dose)
Source: local data frame [6 x 7]
Groups: supp [?]
supp dose N mean sd se ci
(fctr) (dbl) (int) (dbl) (dbl) (dbl) (dbl)
1 OJ 0.5 10 13.23 4.459709 1.4102837 3.190283
2 OJ 1.0 10 22.70 3.910953 1.2367520 2.797727
3 OJ 2.0 10 26.06 2.655058 0.8396031 1.899314
4 VC 0.5 10 7.98 2.746634 0.8685620 1.964824
5 VC 1.0 10 16.77 2.515309 0.7954104 1.799343
6 VC 2.0 10 26.14 4.797731 1.5171757 3.432090
Upvotes: 4
Reputation: 21425
I'm not sure this is more with the "spirit" of dplyr
but you could also try to use strings to calculate mean
, sd
, etc:
summarySE <- function(df, measure, groupvars, conf.int = 0.95) {
df %>% group_by_(.dots = groupvars)%>%
summarise_(N="n()",
mean = paste0("mean(",measure,", na.rm = T)"),
sd = paste0("sd(",measure,", na.rm = T)"),
se = "sd/sqrt(N)",
ci = paste0("se * stats::qt(",conf.int," / 2 + 0.50, N - 1)"))
}
summarySE(tg, "len", c("supp", "dose"))
# supp dose N mean sd se ci
# (fctr) (dbl) (int) (dbl) (dbl) (dbl) (dbl)
#1 OJ 0.5 10 13.23 4.459709 1.4102837 3.190283
#2 OJ 1.0 10 22.70 3.910953 1.2367520 2.797727
#3 OJ 2.0 10 26.06 2.655058 0.8396031 1.899314
#4 VC 0.5 10 7.98 2.746634 0.8685620 1.964824
#5 VC 1.0 10 16.77 2.515309 0.7954104 1.799343
#6 VC 2.0 10 26.14 4.797731 1.5171757 3.432090
Upvotes: 1