lm(): loop through multiple linear models exporting p-value of F-statistic

Question

I have a large data set for which I need to run a linear model comparing groups. I need to find the p-values for group comparisons using a linear model. There are four groups (so I need 1~2, 1~3. 1~4, 2~3, 2~4, 3~4) and and there are 130 columns for which the data from these groups needs to be compared. Any help would be greatly appreciated!!

I have this, which gives me exactly what I need.

fit<-lm(variable~group, data=data)
summary(fit)

However, with all of the groups and columns, I have nearly 800 comparisons to make, so I want to avoid doing this manually. I tried writing a for loop, but it isn't working.

k<-data.frame()
for (i in 1:130){
 [i,1]<-colnames(data)
 fit<- lm(i~group, data=data)
 [i,2] <- fit$p.value
}

But this has given me a variety of different errors. I really just need the p-values. Help would be greatly greatly appreciated!! Thank you!

Zheyuan Li · Accepted Answer

(2016-06-18) Your question is not completely answerable at this stage. In the following, I shall point out several problems.

How to get p-value properly

I assume you want p-value of F-statistic for the model, as an indication of goodness of fit. Suppose your fitted model is fit, we should do this way:

fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))

As an example, I will use built-in dataset trees as an demonstration.

fit <- lm(Height ~ Girth, trees)
## truncated output of summary(fit)
# > summary(fit)
# Residual standard error: 5.538 on 29 degrees of freedom
# Multiple R-squared:  0.2697,  Adjusted R-squared:  0.2445 
F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758

fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
## > p_value
# [1] 0.002757815

So, p_value agrees with the printed summary.

Your loop

I suggest you use vectors rather than data frame during computation/update.

variable <- character(130)
p.value <- numeric(130)

You can combine the results at the end to a data frame via:

k <- data.frame(var = variable, p.value = p.value)

Why? Because this is memory efficient! Now, after those correction, we arrive at:

variable <- character(130)
p.value <- numeric(130)
for (i in 1:130) {
  variable[i] <- colnames(data)
  fit <- lm(i~group, data=data)
  fstatistic <- summary(fit)$fstatistic
  p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
  p.value[i] <- p_value
  }
k <- data.frame(var = variable, p.value = p.value)

Further problems

I still don't think the above code above will work. Because I am not sure whether the following is doing correct:

  variable[i] <- colnames(data)
  fit <- lm(i~group, data=data)

During the loop, data is not changed, so colnames(data) returns a vector, hence var[i] <- colnames(data) will trigger error.
i~group looks odd. Do you have i in your data?

I can't help you solve these issues. I have no idea of what your data looks like. But if you could put in a subset of your data, it would be OK.

Follow-up (2016-06-19)

Thank you. This has been extremely helpful. I don't have "i" in my data, but I was hoping that I could use that to represent the different column names, so that it goes through all of them. Is there a way to assign column names numbers so that this would work?

Yes, but I need to know what you have for each column.

Column 1 has a group number. The following columns have data for different factors I am looking at.

OK, so I think ncol(data) = 131, where the first column is group, and the remaining 130 columns are what you will test. Then this should work:

variable <- colnames(data)[-1]
p.value <- numeric(130)
for (i in 1:130) {
  fit <- lm(paste(variable[i], "group", sep = "~"), data=data)
  fstatistic <- summary(fit)$fstatistic
  p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
  p.value[i] <- p_value
  }
k <- data.frame(var = variable, p.value = p.value)

It is possible to use sapply() instead of the above for loop. But I think there is no performance difference, as loop overhead is so much tiny compared with lm() and summary().

lm(): loop through multiple linear models exporting p-value of F-statistic

Answers (2)

Related Questions