Reputation: 41
A common task is to have to carry out a certain statistical analysis (like an anova, glm or mixed model) on different subsets of a dataset and combine the output tables with summary coefficients and p values in a single dataframe. I am looking though for a generic function that would take the type of model (e.g. aov(...)
or lm(...)
or glm(...)
or glmer(...)
) and the particular output terms for which the coefficients and p values would have to be returned for each of the replicate analyses according to some grouping variable(s) in one's dataset.
Say if I have a dataframe in which I would like to carry out a certain analysis over the different levels of factor "replicate" in a dataframe data
:
data(iris)
library(car)
data=data.frame()
for (i in 1:10) {data=rbind(data,cbind(replicate=i,iris))}
Using broom+dplyr
, I could e.g. do an anova over each subsets of this dataframe (grouping by replicate) and keep the p values for term "Species" using
library(devtools)
install_github("dgrtwo/broom")
library(broom)
library(dplyr)
group_by(data, replicate) %>% do(tidy(Anova(aov(Sepal.Length ~ Species, data = .),type="III"))) %>% filter(term=="Species")
Source: local data frame [10 x 6]
Groups: replicate [10]
replicate term sumsq df statistic p.value
(int) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 Species 189.6364 2 362.6614 2.580311e-94
2 2 Species 189.6364 2 362.6614 2.580311e-94
3 3 Species 189.6364 2 362.6614 2.580311e-94
4 4 Species 189.6364 2 362.6614 2.580311e-94
5 5 Species 189.6364 2 362.6614 2.580311e-94
6 6 Species 189.6364 2 362.6614 2.580311e-94
7 7 Species 189.6364 2 362.6614 2.580311e-94
8 8 Species 189.6364 2 362.6614 2.580311e-94
9 9 Species 189.6364 2 362.6614 2.580311e-94
10 10 Species 189.6364 2 362.6614 2.580311e-94
(I used 10 identical data subsets just as an example here)
I am looking though for a more generic function "Anovabygroup
", which would take the dataframe, the grouping variable(s) (here replicate
, but it could also be the combination of several grouping variables), the type of model to run (e.g. in this case 'aov(Sepal.Length ~ Species, data = .)'
, but it could also be a lm, glm, lme, lmer or glmer model or any other model handled by Anova()
) and the factors to return coefficients and p values for (maybe with option "all" to return everything) as arguments (any other options given could be passed on to the call to Anova). Would anyone know how to do this by any chance, using code similar to that used above, but generalised to take these arguments? Main thing I don't know how to do is to pass on the model (e.g. in this case `'aov(Sepal.Length ~ Species, data = .)') as an argument and have it evaluated. Or does it perhaps already exist in some package? I think this could be useful as I always find myself coding this task over and over again...
PS I used github version of the broom package as the current CRAN version doesn't seem to handle Anova output well
Upvotes: 4
Views: 5802
Reputation: 1509
You can solve this by creating a wrapper function that parses text inputs.
Here, I use parse
and eval
.
Ideally the function would also check the validity of the expression (glm
,lm
, etc) that is passed, but here is an example function to get you started.
It also lets you pass additional options to Anova
as you requested:
anova_wrapper <- function(data, model_expression_as_string, grouping_variable,...) {
f_wrap <- paste0('function(.) {',model_expression_as_string,'}') %>% parse(text=.) %>% eval
data %>% group_by_(grouping_variable) %>%
do(f_wrap(.) %>% Anova(...=...) %>% tidy) %>% return
}
Using your example code (thank you for posting good code!):
data=data.frame()
for (i in 1:10) {data=rbind(data,cbind(replicate=i,iris))}
aov_model_expression_as_string = 'aov(Sepal.Length ~ Species, data = .)'
lm_model_expression_as_string = 'lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = .)'
grouping_variable = 'replicate'
data %>%
anova_wrapper(model_expression_as_string = aov_model_expression_as_string,
grouping_variable = grouping_variable,type="III")
Source: local data frame [30 x 6]
Groups: replicate [10]
replicate term sumsq df statistic p.value
(int) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 (Intercept) 1253.00180 1 4728.1630 1.134286e-113
2 1 Species 63.21213 2 119.2645 1.669669e-31
3 1 Residuals 38.95620 147 NA NA
4 2 (Intercept) 1253.00180 1 4728.1630 1.134286e-113
5 2 Species 63.21213 2 119.2645 1.669669e-31
6 2 Residuals 38.95620 147 NA NA
7 3 (Intercept) 1253.00180 1 4728.1630 1.134286e-113
8 3 Species 63.21213 2 119.2645 1.669669e-31
9 3 Residuals 38.95620 147 NA NA
10 4 (Intercept) 1253.00180 1 4728.1630 1.134286e-113
.. ... ... ... ... ... ...
And using a lm
instead of aov
and different argument for Anova
:
data %>%
anova_wrapper(model_expression_as_string = lm_model_expression_as_string,
grouping_variable = grouping_variable,type="III")
replicate term sumsq df statistic p.value
(int) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 Sepal.Width 8.19627 1 73.78707 1.163254e-14
2 1 Petal.Length 84.42733 1 760.05861 5.847914e-60
3 1 Residuals 16.32876 147 NA NA
4 2 Sepal.Width 8.19627 1 73.78707 1.163254e-14
5 2 Petal.Length 84.42733 1 760.05861 5.847914e-60
6 2 Residuals 16.32876 147 NA NA
7 3 Sepal.Width 8.19627 1 73.78707 1.163254e-14
8 3 Petal.Length 84.42733 1 760.05861 5.847914e-60
9 3 Residuals 16.32876 147 NA NA
10 4 Sepal.Width 8.19627 1 73.78707 1.163254e-14
.. ... ... ... ... ... ...
I regularly these types of regressions/anovas repeatedly in monte carlo simulations, but I normally create a seperate function for each sort of analysis. There may be interest from the R community in a package that does these analyses repeatedly!
Upvotes: 4