Julia Bruner
Julia Bruner

Reputation: 1

Loop to apply lm() over rows of a matrix and extract significant models

I have a dataframe with 8,000 genes expression values in rows and 40 columns. The 40 columns represent observations from 4 different time points. The first ten columns are from no treatment, the second ten are after one week of treatment, the third column is after 2 weeks of treatment, and the fourth column is after 7 weeks of treatment.

This is the idea (smaller ex edit):

m1 <- matrix(data=1:32, nrow = 8, ncol = 4)
time <- c(1:4)

summaries <- data.frame(nrow=8, ncol=4)
pvalues <- function(x) {
for (i in (1:8)) {
raw <- unlist(as.vector(x[i,]))
lm <- lm(raw ~ time)
summaries[i, ] <- summary(lm)$coefficients[,4]
return(summaries)
}
}
pvalues(m1)

I know it is still overwriting but I don't know how to fit ix.

Upvotes: 0

Views: 166

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145765

Okay I ran your code and got

Warning message:

In summary.lm(lm) : essentially perfect fit: summary may be unreliable

This is because the sample data was sequential. I changed m1 to use rnorm(32) instead of 1:32, and that problem is solved. (Though this problem is specific to the example, not the real case, I would assume.)

Next problem: the size of the summaries object. You used data.frame which takes column names as arguments. We want to use matrix which lets you set the rows and columns:

## Bad - 1 row and 2 columns
summaries <- data.frame(nrow=8, ncol=4)
summaries
#   nrow ncol
# 1    8    4

## Good - 8 rows, 2 columns
summaries <- matrix(nrow = 8, ncol = 2)
# summaries
#      [,1] [,2]
# [1,]   NA   NA
# [2,]   NA   NA
# [3,]   NA   NA
# [4,]   NA   NA
# [5,]   NA   NA
# [6,]   NA   NA
# [7,]   NA   NA
# [8,]   NA   NA

But then, running your code still only the first row is filled! Why? A function stops as soon as it hits a return(), thinking it is done. We need to move return(summaries) to after the for loop, not inside the for loop. Putting it all together:

m1 <- matrix(data = rnorm(32),
             nrow = 8,
             ncol = 4)
time <- c(1:4)

summaries <- matrix(nrow = 8, ncol = 2)
pvalues <- function(x) {
  for (i in (1:8)) {
    raw <- c(x[i, ])
    lm <- lm(raw ~ time)
    summaries[i,] <- summary(lm)$coefficients[, 4]
  } ## End the for loop before the `return()`
  return(summaries)
}
pvalues(m1)
#           [,1]      [,2]
# [1,] 0.6235167 0.5461115
# [2,] 0.4256698 0.3992509
# [3,] 0.3041439 0.2751724
# [4,] 0.8087557 0.8252432
# [5,] 0.8820501 0.1812292
# [6,] 0.4997327 0.5582880
# [7,] 0.5589398 0.8150613
# [8,] 0.6283059 0.8994896

We can gussy it up a little bit by assigning column names based on the last iteration summary:

pvalues <- function(x) {
  for (i in (1:8)) {
    raw <- c(x[i, ])
    lm <- lm(raw ~ time)
    summaries[i,] <- summary(lm)$coefficients[, 4]
  }
  colnames(summaries) = names(summary(lm)$coefficients[, 4])
  return(summaries)
}
pvalues(m1)
#      (Intercept)      time
# [1,]   0.6235167 0.5461115
# [2,]   0.4256698 0.3992509
# [3,]   0.3041439 0.2751724
# [4,]   0.8087557 0.8252432
# [5,]   0.8820501 0.1812292
# [6,]   0.4997327 0.5582880
# [7,]   0.5589398 0.8150613
# [8,]   0.6283059 0.8994896

Upvotes: 1

Related Questions