Katie Tetzloff
Katie Tetzloff

Reputation: 65

lm(): loop through multiple linear models exporting p-value of F-statistic

I have a large data set for which I need to run a linear model comparing groups. I need to find the p-values for group comparisons using a linear model. There are four groups (so I need 1~2, 1~3. 1~4, 2~3, 2~4, 3~4) and and there are 130 columns for which the data from these groups needs to be compared. Any help would be greatly appreciated!!

I have this, which gives me exactly what I need.

fit<-lm(variable~group, data=data)
summary(fit)

However, with all of the groups and columns, I have nearly 800 comparisons to make, so I want to avoid doing this manually. I tried writing a for loop, but it isn't working.

k<-data.frame()
for (i in 1:130){
 [i,1]<-colnames(data)
 fit<- lm(i~group, data=data)
 [i,2] <- fit$p.value
}

But this has given me a variety of different errors. I really just need the p-values. Help would be greatly greatly appreciated!! Thank you!

Upvotes: 1

Views: 1453

Answers (2)

Zheyuan Li
Zheyuan Li

Reputation: 73265

(2016-06-18) Your question is not completely answerable at this stage. In the following, I shall point out several problems.


How to get p-value properly

I assume you want p-value of F-statistic for the model, as an indication of goodness of fit. Suppose your fitted model is fit, we should do this way:

fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))

As an example, I will use built-in dataset trees as an demonstration.

fit <- lm(Height ~ Girth, trees)
## truncated output of summary(fit)
# > summary(fit)
# Residual standard error: 5.538 on 29 degrees of freedom
# Multiple R-squared:  0.2697,  Adjusted R-squared:  0.2445 
F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758

fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
## > p_value
# [1] 0.002757815

So, p_value agrees with the printed summary.


Your loop

I suggest you use vectors rather than data frame during computation/update.

variable <- character(130)
p.value <- numeric(130)

You can combine the results at the end to a data frame via:

k <- data.frame(var = variable, p.value = p.value)

Why? Because this is memory efficient! Now, after those correction, we arrive at:

variable <- character(130)
p.value <- numeric(130)
for (i in 1:130) {
  variable[i] <- colnames(data)
  fit <- lm(i~group, data=data)
  fstatistic <- summary(fit)$fstatistic
  p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
  p.value[i] <- p_value
  }
k <- data.frame(var = variable, p.value = p.value)

Further problems

I still don't think the above code above will work. Because I am not sure whether the following is doing correct:

  variable[i] <- colnames(data)
  fit <- lm(i~group, data=data)
  1. During the loop, data is not changed, so colnames(data) returns a vector, hence var[i] <- colnames(data) will trigger error.
  2. i~group looks odd. Do you have i in your data?

I can't help you solve these issues. I have no idea of what your data looks like. But if you could put in a subset of your data, it would be OK.


Follow-up (2016-06-19)

Thank you. This has been extremely helpful. I don't have "i" in my data, but I was hoping that I could use that to represent the different column names, so that it goes through all of them. Is there a way to assign column names numbers so that this would work?

Yes, but I need to know what you have for each column.

Column 1 has a group number. The following columns have data for different factors I am looking at.

OK, so I think ncol(data) = 131, where the first column is group, and the remaining 130 columns are what you will test. Then this should work:

variable <- colnames(data)[-1]
p.value <- numeric(130)
for (i in 1:130) {
  fit <- lm(paste(variable[i], "group", sep = "~"), data=data)
  fstatistic <- summary(fit)$fstatistic
  p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
  p.value[i] <- p_value
  }
k <- data.frame(var = variable, p.value = p.value)

It is possible to use sapply() instead of the above for loop. But I think there is no performance difference, as loop overhead is so much tiny compared with lm() and summary().

Upvotes: 1

Nick DiQuattro
Nick DiQuattro

Reputation: 739

I think this can get you started at least. It uses the dplyr and broom packages. The basic idea is to define all the formulas you want as characters then use lapply() to run them through lm().

library(dplyr)
library(broom)

# Generate a vector of wanted formulas
forms <- c("mpg ~ cyl", "mpg ~ wt")

# Function to apply formula
lmit <- function(form){
  tidy(lm(as.formula(form), mtcars)) %>% 
    mutate(formula = form)
}

# Apply it and bind into a dataframe
results <- bind_rows(lapply(forms, lmit))

Upvotes: 0

Related Questions