Reputation: 954
From the variables pointA, pointB, pointC run in a linear model and anova I would like to collect the p-values in a data frame. In my real data set there are around 30 variables, so I'm looking for a function running through these and collecting the p-values in a data frame. A more efficient way than running through each and manually putting them together in a data frame.
set.seed(1)
id <- rep(1:3,each=4)
trt <- rep(c("A","OA", "B", "OB"),3)
pointA <- sample(1:10,12, replace=TRUE)
pointB<- sample(1:10,12, replace=TRUE)
pointC<- sample(1:10,12, replace=TRUE)
df <- data.frame(id,trt,pointA, pointB,pointC)
df
id trt pointA pointB pointC
1 1 A 8 8 10
2 1 OA 2 7 3
3 1 B 8 5 5
4 1 OB 5 9 4
5 2 A 9 5 7
6 2 OA 7 3 3
7 2 B 8 1 5
8 2 OB 6 1 8
9 3 A 6 4 1
10 3 OA 8 6 9
11 3 B 1 7 4
12 3 OB 5 5 9
data <- function(i){
lmdf <- lm(df[,i]~trt, data=df)
anv <- anova(lmdf)
pvalue <- anv$`Pr(>F)`
return(pvalue)
}
data(5)
I would like it to look something like:
variable pvalue
1 pointA 0.714
2 pointB 0.949
3 pointC 0.080
Upvotes: 2
Views: 94
Reputation: 46938
You can try using the broom package, because this has the flexibility for more than 1 term in anova, and also if you need other stats.
We first pivot longer so that your point to test in now a column variable:
library(dplyr)
library(tidyr)
library(broom)
df %>% pivot_longer(-c(id,trt))
# A tibble: 36 x 4
id trt name value
<int> <fct> <chr> <int>
1 1 A pointA 9
2 1 A pointB 6
3 1 A pointC 9
4 1 OA pointA 4
5 1 OA pointB 10
6 1 OA pointC 1
7 1 B pointA 7
8 1 B pointB 7
9 1 B pointC 4
10 1 OB pointA 1
From this point on, it is grouping by the column name, doing an anova with the same formula, and collecting the stats (using tidy
from broom
) :
res = df %>% pivot_longer(-c(id,trt)) %>%
group_by(name) %>% do(tidy(anova(lm(value ~ trt,data=.))))
# A tibble: 6 x 7
# Groups: name [3]
name term df sumsq meansq statistic p.value
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 pointA trt 3 2.67 0.889 0.0711 0.974
2 pointA Residuals 8 100. 12.5 NA NA
3 pointB trt 3 27.7 9.22 1.68 0.248
4 pointB Residuals 8 44.0 5.50 NA NA
5 pointC trt 3 14.0 4.67 0.386 0.766
6 pointC Residuals 8 96.7 12.1 NA NA
You just need to filter your results from trt:
res %>% filter(term=="trt")
# A tibble: 3 x 7
# Groups: name [3]
name term df sumsq meansq statistic p.value
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 pointA trt 3 2.67 0.889 0.0711 0.974
2 pointB trt 3 27.7 9.22 1.68 0.248
3 pointC trt 3 14.0 4.67 0.386 0.766
You can do it in baseR:
library(broom)
res = lapply(colnames(df)[-c(1:2)],function(i){
this_formula = as.formula(paste(i,"~ trt"))
anova_fit = anova(lm(this_formula, data=df))
data.frame(name=i,statistic=anova_fit[1,4],p.value=anova_fit[1,5])
})
res = do.call(rbind,res)
name statistic p.value
1 pointA 0.07111111 0.9737895
2 pointB 1.67676768 0.2482931
3 pointC 0.38620690 0.7660808
Upvotes: 2
Reputation: 5336
You can treat the columns of a dataframe as a list, and use sapply
to iterate over the columns you want to use as outcomes.
Getting the p-value out of each ANOVA is a bit fiddly, maybe somebody knows a better way, but this works:
pvalues <- sapply( df[,3:5] , FUN = function(x)
summary(aov(x~df$trt))[[1]]$`Pr(>F)`[1]
)
data.frame(pvalues)
pvalues
pointA 0.9737895
pointB 0.2482931
pointC 0.7660808
I did think about using a multivariate regression, but it was no easier to get the p-values for each outcome variable that way.
Upvotes: 2