user11916948
user11916948

Reputation: 954

Function for running a linear model and anova for several variables and collect the p-values in a data frame

From the variables pointA, pointB, pointC run in a linear model and anova I would like to collect the p-values in a data frame. In my real data set there are around 30 variables, so I'm looking for a function running through these and collecting the p-values in a data frame. A more efficient way than running through each and manually putting them together in a data frame.

set.seed(1)
id <- rep(1:3,each=4)
trt <- rep(c("A","OA", "B", "OB"),3)
pointA <- sample(1:10,12, replace=TRUE)
pointB<- sample(1:10,12, replace=TRUE)
pointC<- sample(1:10,12, replace=TRUE)
df <- data.frame(id,trt,pointA, pointB,pointC)
df

id trt pointA pointB pointC
1   1   A      8      8     10
2   1  OA      2      7      3
3   1   B      8      5      5
4   1  OB      5      9      4
5   2   A      9      5      7
6   2  OA      7      3      3
7   2   B      8      1      5
8   2  OB      6      1      8
9   3   A      6      4      1
10  3  OA      8      6      9
11  3   B      1      7      4
12  3  OB      5      5      9


data <- function(i){
  lmdf <- lm(df[,i]~trt, data=df)
  anv <- anova(lmdf)
  pvalue <- anv$`Pr(>F)`
  return(pvalue)
  }
data(5)

I would like it to look something like:

 variable pvalue
1   pointA  0.714
2   pointB  0.949
3   pointC  0.080

Upvotes: 2

Views: 94

Answers (2)

StupidWolf
StupidWolf

Reputation: 46938

You can try using the broom package, because this has the flexibility for more than 1 term in anova, and also if you need other stats.

We first pivot longer so that your point to test in now a column variable:

library(dplyr)
library(tidyr)
library(broom)

df %>% pivot_longer(-c(id,trt)) 
# A tibble: 36 x 4
      id trt   name   value
   <int> <fct> <chr>  <int>
 1     1 A     pointA     9
 2     1 A     pointB     6
 3     1 A     pointC     9
 4     1 OA    pointA     4
 5     1 OA    pointB    10
 6     1 OA    pointC     1
 7     1 B     pointA     7
 8     1 B     pointB     7
 9     1 B     pointC     4
10     1 OB    pointA     1

From this point on, it is grouping by the column name, doing an anova with the same formula, and collecting the stats (using tidy from broom) :

res = df %>% pivot_longer(-c(id,trt)) %>% 
group_by(name) %>% do(tidy(anova(lm(value ~ trt,data=.)))) 
# A tibble: 6 x 7
# Groups:   name [3]
  name   term         df  sumsq meansq statistic p.value
  <chr>  <chr>     <int>  <dbl>  <dbl>     <dbl>   <dbl>
1 pointA trt           3   2.67  0.889    0.0711   0.974
2 pointA Residuals     8 100.   12.5     NA       NA    
3 pointB trt           3  27.7   9.22     1.68     0.248
4 pointB Residuals     8  44.0   5.50    NA       NA    
5 pointC trt           3  14.0   4.67     0.386    0.766
6 pointC Residuals     8  96.7  12.1     NA       NA   

You just need to filter your results from trt:

res %>% filter(term=="trt")
# A tibble: 3 x 7
# Groups:   name [3]
  name   term     df sumsq meansq statistic p.value
  <chr>  <chr> <int> <dbl>  <dbl>     <dbl>   <dbl>
1 pointA trt       3  2.67  0.889    0.0711   0.974
2 pointB trt       3 27.7   9.22     1.68     0.248
3 pointC trt       3 14.0   4.67     0.386    0.766

You can do it in baseR:

library(broom)
res = lapply(colnames(df)[-c(1:2)],function(i){
this_formula = as.formula(paste(i,"~ trt"))
anova_fit = anova(lm(this_formula, data=df))
data.frame(name=i,statistic=anova_fit[1,4],p.value=anova_fit[1,5])
})
res = do.call(rbind,res)

    name     statistic   p.value
1 pointA 0.07111111 0.9737895
2 pointB 1.67676768 0.2482931
3 pointC 0.38620690 0.7660808

Upvotes: 2

George Savva
George Savva

Reputation: 5336

You can treat the columns of a dataframe as a list, and use sapply to iterate over the columns you want to use as outcomes.

Getting the p-value out of each ANOVA is a bit fiddly, maybe somebody knows a better way, but this works:

pvalues <- sapply( df[,3:5] , FUN = function(x) 
  summary(aov(x~df$trt))[[1]]$`Pr(>F)`[1]  
  )
data.frame(pvalues)

         pvalues
pointA 0.9737895
pointB 0.2482931
pointC 0.7660808

I did think about using a multivariate regression, but it was no easier to get the p-values for each outcome variable that way.

Upvotes: 2

Related Questions