user2999576
user2999576

Reputation: 41

R: function to apply Anova over different subsets of one's dataset and collect output

A common task is to have to carry out a certain statistical analysis (like an anova, glm or mixed model) on different subsets of a dataset and combine the output tables with summary coefficients and p values in a single dataframe. I am looking though for a generic function that would take the type of model (e.g. aov(...) or lm(...) or glm(...) or glmer(...) ) and the particular output terms for which the coefficients and p values would have to be returned for each of the replicate analyses according to some grouping variable(s) in one's dataset.

Say if I have a dataframe in which I would like to carry out a certain analysis over the different levels of factor "replicate" in a dataframe data :

data(iris)
library(car)
data=data.frame()
for (i in 1:10) {data=rbind(data,cbind(replicate=i,iris))}

Using broom+dplyr, I could e.g. do an anova over each subsets of this dataframe (grouping by replicate) and keep the p values for term "Species" using

library(devtools)
install_github("dgrtwo/broom")
library(broom)
library(dplyr)

group_by(data, replicate) %>% do(tidy(Anova(aov(Sepal.Length ~ Species, data = .),type="III"))) %>% filter(term=="Species")

Source: local data frame [10 x 6]
Groups: replicate [10]

   replicate    term    sumsq    df statistic      p.value
       (int)   (chr)    (dbl) (dbl)     (dbl)        (dbl)
1          1 Species 189.6364     2  362.6614 2.580311e-94
2          2 Species 189.6364     2  362.6614 2.580311e-94
3          3 Species 189.6364     2  362.6614 2.580311e-94
4          4 Species 189.6364     2  362.6614 2.580311e-94
5          5 Species 189.6364     2  362.6614 2.580311e-94
6          6 Species 189.6364     2  362.6614 2.580311e-94
7          7 Species 189.6364     2  362.6614 2.580311e-94
8          8 Species 189.6364     2  362.6614 2.580311e-94
9          9 Species 189.6364     2  362.6614 2.580311e-94
10        10 Species 189.6364     2  362.6614 2.580311e-94

(I used 10 identical data subsets just as an example here)

I am looking though for a more generic function "Anovabygroup", which would take the dataframe, the grouping variable(s) (here replicate, but it could also be the combination of several grouping variables), the type of model to run (e.g. in this case 'aov(Sepal.Length ~ Species, data = .)', but it could also be a lm, glm, lme, lmer or glmer model or any other model handled by Anova()) and the factors to return coefficients and p values for (maybe with option "all" to return everything) as arguments (any other options given could be passed on to the call to Anova). Would anyone know how to do this by any chance, using code similar to that used above, but generalised to take these arguments? Main thing I don't know how to do is to pass on the model (e.g. in this case `'aov(Sepal.Length ~ Species, data = .)') as an argument and have it evaluated. Or does it perhaps already exist in some package? I think this could be useful as I always find myself coding this task over and over again...

PS I used github version of the broom package as the current CRAN version doesn't seem to handle Anova output well

Upvotes: 4

Views: 5802

Answers (1)

Nelson Auner
Nelson Auner

Reputation: 1509

Answer:

You can solve this by creating a wrapper function that parses text inputs. Here, I use parse and eval.

Ideally the function would also check the validity of the expression (glm,lm, etc) that is passed, but here is an example function to get you started.

It also lets you pass additional options to Anova as you requested:

anova_wrapper <- function(data, model_expression_as_string, grouping_variable,...) {
  f_wrap <- paste0('function(.) {',model_expression_as_string,'}') %>%    parse(text=.) %>% eval
  data %>% group_by_(grouping_variable) %>% 
     do(f_wrap(.) %>% Anova(...=...) %>% tidy) %>% return
}

Example:

Using your example code (thank you for posting good code!):

data=data.frame()
for (i in 1:10) {data=rbind(data,cbind(replicate=i,iris))}

aov_model_expression_as_string = 'aov(Sepal.Length ~ Species, data = .)'
lm_model_expression_as_string = 'lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = .)'
grouping_variable = 'replicate'


data %>% 
    anova_wrapper(model_expression_as_string = aov_model_expression_as_string,
                  grouping_variable = grouping_variable,type="III")



    Source: local data frame [30 x 6]
    Groups: replicate [10]

    replicate        term      sumsq    df statistic       p.value
    (int)       (chr)      (dbl) (dbl)     (dbl)         (dbl)
    1          1 (Intercept) 1253.00180     1 4728.1630 1.134286e-113
    2          1     Species   63.21213     2  119.2645  1.669669e-31
    3          1   Residuals   38.95620   147        NA            NA
    4          2 (Intercept) 1253.00180     1 4728.1630 1.134286e-113
    5          2     Species   63.21213     2  119.2645  1.669669e-31
    6          2   Residuals   38.95620   147        NA            NA
    7          3 (Intercept) 1253.00180     1 4728.1630 1.134286e-113
    8          3     Species   63.21213     2  119.2645  1.669669e-31
    9          3   Residuals   38.95620   147        NA            NA
    10         4 (Intercept) 1253.00180     1 4728.1630 1.134286e-113
    ..       ...         ...        ...   ...       ...           ...

And using a lm instead of aov and different argument for Anova:

data %>% 
    anova_wrapper(model_expression_as_string = lm_model_expression_as_string,
                  grouping_variable = grouping_variable,type="III")



    replicate         term    sumsq    df statistic      p.value
    (int)        (chr)    (dbl) (dbl)     (dbl)        (dbl)
    1          1  Sepal.Width  8.19627     1  73.78707 1.163254e-14
    2          1 Petal.Length 84.42733     1 760.05861 5.847914e-60
    3          1    Residuals 16.32876   147        NA           NA
    4          2  Sepal.Width  8.19627     1  73.78707 1.163254e-14
    5          2 Petal.Length 84.42733     1 760.05861 5.847914e-60
    6          2    Residuals 16.32876   147        NA           NA
    7          3  Sepal.Width  8.19627     1  73.78707 1.163254e-14
    8          3 Petal.Length 84.42733     1 760.05861 5.847914e-60
    9          3    Residuals 16.32876   147        NA           NA
    10         4  Sepal.Width  8.19627     1  73.78707 1.163254e-14
    ..       ...          ...      ...   ...       ...          ...

I regularly these types of regressions/anovas repeatedly in monte carlo simulations, but I normally create a seperate function for each sort of analysis. There may be interest from the R community in a package that does these analyses repeatedly!

Upvotes: 4

Related Questions