Reputation: 11
Apologies if this has been answered - I did not find quite what I was looking for. I have 2 predictor variables that I want to use to do a huge number of separate multiple regressions (i.e., on each column of a >3,000,000 column matrix). I know how to run the lm command to do this, such that I get a separate model for each column using the predictors. However, I need to extract the t-value (and separately, p-value) for only ONE of the predictor variables (the other one is a control variable).
For example:
nXvariables <- 2
nYvariables <- 4
nObs <- 15
x <- matrix(rnorm(nXvariables*nObs),nrow=nObs)
y <- matrix(rnorm(nYvariables*nObs),nrow=nObs)
allModels <-lm(y~x)
summary(allModels)
Response Y1 :
Call:
lm(formula = Y1 ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.51271 -0.72176 -0.07585 0.42939 2.02093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2715 0.3081 0.881 0.396
x1 0.1937 0.2969 0.652 0.526
x2 0.5227 0.3336 1.567 0.143
Residual standard error: 1.049 on 12 degrees of freedom
Multiple R-squared: 0.1968, Adjusted R-squared: 0.06292
F-statistic: 1.47 on 2 and 12 DF, p-value: 0.2685
Response Y2 :
Call:
lm(formula = Y2 ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.4219 -0.4949 -0.1893 0.6164 1.3186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1474 0.2624 -0.562 0.585
x1 -0.4246 0.2528 -1.679 0.119
x2 -0.2980 0.2841 -1.049 0.315
Residual standard error: 0.8934 on 12 degrees of freedom
Multiple R-squared: 0.2509, Adjusted R-squared: 0.1261
F-statistic: 2.01 on 2 and 12 DF, p-value: 0.1766
Response Y3 :
Call:
lm(formula = Y3 ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.22364 -0.53426 -0.01539 0.14891 1.80698
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2900 0.2600 -1.116 0.286
x1 0.2976 0.2505 1.188 0.258
x2 0.1262 0.2814 0.448 0.662
Residual standard error: 0.8851 on 12 degrees of freedom
Multiple R-squared: 0.1204, Adjusted R-squared: -0.02617
F-statistic: 0.8215 on 2 and 12 DF, p-value: 0.4631
Response Y4 :
Call:
lm(formula = Y4 ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.0015 -0.2455 0.1279 0.5376 1.1856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09163 0.27600 -0.332 0.746
x1 -0.28386 0.26592 -1.067 0.307
x2 -0.05683 0.29878 -0.190 0.852
Residual standard error: 0.9396 on 12 degrees of freedom
Multiple R-squared: 0.09006, Adjusted R-squared: -0.06159
F-statistic: 0.5939 on 2 and 12 DF, p-value: 0.5676
This lists 4 separate models, where the 2 predictor (X) variables are separately used to predict each of the 4 (Y) dependent ("response") variables. However, I need a vector that lists the t-values for the x1 variable only (i.e., the value in the "t value" column of each of the "Coefficients" table that corresponds to the x1 variable alone). So the vector I need, for this example, would look like:
0.652, -1.679, 1.188, -1.067
I also need a separate vector that has the corresponding probability values for each of these (i.e., the corresponding x1 values in each of the "Coefficients" tables in the "Pr(>|t|)" column).
I tried saving summary(allModels) into an R object, and then trying to extract subsets of that data, but one of the levels is "Response Y1" etc., and the space causes it to bail (I think).
Upvotes: 1
Views: 204
Reputation: 11
The suggestions are very helpful, but unfortunately the biggest limitation for my system was RAM. The solution, given this, was to do a loop, doing the lm() for each column of the matrix one at a time and extracting the coefficients each time. Below is the code that worked for me (it took my system ~2.5 hours to complete >3,000,000 columns). Note that it turns out to be MUCH faster to make empty vectors of the right length beforehand, and put the values in the correct space one at a time, rather than using append() (Something to do with the way R deals with vectors...). I used Marco Sandri's ideas about how to extract the coefficients I wanted (thank you!). I also cobbled together a simple progress indicator, to help me know that it was running consistently.
Assuming: Data is in matrix.with.my.data, and that variable1, variable2, and variable3 are vectors of the correct length and whose data order matches that of the matrix rows
# calculate total number of columns(voxels)
totalColumns<-ncol(matrix.with.my.data)
# get the current time so we can estimate remaining time for our counter
startTime<-Sys.time()
# initialize the variables we will be putting the results for each column into:
variable1.slope <- vector(,totalColumns)
variable1.tvalue <- vector(,totalColumns)
variable1.pvalue <- vector(,totalColumns)
variable2.slope <- vector(,totalColumns)
variable2.tvalue <- vector(,totalColumns)
variable2.pvalue <- vector(,totalColumns)
variable3.slope <- vector(,totalColumns)
variable3.tvalue <- vector(,totalColumns)
variable3.pvalue <- vector(,totalColumns)
# here is the loop that does the work, one column at a time:
for (column in 1:totalColumns)
{
# progress indicator: print out % of total columns completed (and the time stamp) every 5000 columns
# Modulus operation
if(column %% 5000==0)
{
# Print number completed on screen, and estimate time remaining
cat(paste0(round(column/totalColumns*100, digits=1), "% completed; ", "Estimated time remaining: ", round((((as.numeric(Sys.time())-as.numeric(startTime))/voxel)*(totalColumns-voxel))/60), digits=0), " minutes\n"))
}
model.summary<-summary(lm(matrix.with.my.data[, column] ~ variable1 + variable2 + variable3))
# pull out the variable1 coefficients, and append them to the appropriate vector:
variable1.slope[column]<- coefficients(model.summary)["variable1","Estimate"]
variable1.tvalue[column]<- coefficients(model.summary)["variable1","t value"]
variable1.pvalue[column]<- coefficients(model.summary)["variable1","Pr(>|t|)"]
# pull out the variable2 coefficients, and append them to the appropriate vector:
variable2.slope[column]<- coefficients(model.summary)["variable2","Estimate"]
variable2.tvalue[column]<- coefficients(model.summary)["variable2","t value"]
variable2.pvalue[column]<- coefficients(model.summary)["variable2","Pr(>|t|)"]
# pull out the variable3 coefficients, and append them to the appropriate vector:
variable3.slope[column]<- coefficients(model.summary)["variable3","Estimate"]
variable3.tvalue[column]<- coefficients(model.summary)["variable3","t value"]
variable3.pvalue[column]<- coefficients(model.summary)["variable3","Pr(>|t|)"]
}
Upvotes: 0
Reputation: 17369
broom
makes this pretty easy.
library(broom)
allModelsFrame <- tidy(allModels)
allModelsFrame[allModelsFrame$term == "x1", ]
Since the size of your project is creating an issue with computation time, here's another approach that obtains the coefficients and calculates the statistics independently.
coef <- as.vector(coef(allModels))
se <- sqrt(diag(vcov(allModels)))
t <- coef / se
p <- 2 * (1 - pt(abs(t), df = allModels$df.residual, lower.tail = TRUE))
I find it interesting, however, that when I compare this new code with Marco's answer, Marco's is a bit faster. I'd recommend using his solution.
If his solution is still unbearably slow, I think you will need to look into building one model at a time, and parallelizing the process.
library(parallel)
cl <- makeCluster(7)
clusterExport(cl, c("x", "y"))
parLapply(cl = cl,
X = seq_len(ncol(y)),
fun = function(i){
coef(summary(lm(y[, i] ~ x)))
})
Upvotes: 2
Reputation: 24252
Another solution:
nXvariables <- 2; nYvariables <- 4; nObs <- 15
set.seed(1)
x <- matrix(rnorm(nXvariables*nObs),nrow=nObs)
y <- matrix(rnorm(nYvariables*nObs),nrow=nObs)
allModels <-lm(y~x)
out <- summary(allModels)
(cf <- coef(out))
# Response Y1 :
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.1420606 0.1562425 0.9092312 0.38112719
# x1 -0.2494874 0.1586252 -1.5728108 0.14174445
# x2 -0.4153935 0.1886099 -2.2023954 0.04793035
#
# Response Y2 :
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.14474061 0.2171560 0.6665283 0.51768356
# x1 -0.02212358 0.2204676 -0.1003485 0.92172473
# x2 0.51118985 0.2621422 1.9500477 0.07492154
#
# Response Y3 :
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.20094144 0.3323848 0.6045446 0.5567399
# x1 -0.15610930 0.3374536 -0.4626096 0.6519203
# x2 -0.08045549 0.4012420 -0.2005161 0.8444352
#
# Response Y4 :
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.023140470 0.1813438 0.12760553 0.9005746
# x1 0.003848764 0.1841092 0.02090478 0.9836652
# x2 0.265064518 0.2189112 1.21083145 0.2492681
Here are some ways for extracting data from the cf
matrix:*
# Matrix of coefficients, std. errors, t values and p-values for Y1 response variable
cf[1]
# Response Y1 :
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.1420606 0.1562425 0.9092312 0.38112719
# x1 -0.2494874 0.1586252 -1.5728108 0.14174445
# x2 -0.4153935 0.1886099 -2.2023954 0.04793035
# Vector of coefficients for Y1 response variable
cf[1]$'Response Y1'[,1]
# Vector of p-values for Y1 response variable
cf[1]$'Response Y1'[,4]
# Vector of x1 coefficients for all the response variables
sapply(cf,"[",2,1)
#
# Response Y1 Response Y2 Response Y3 Response Y4
# -0.249487417 -0.022123582 -0.156109301 0.003848764
Upvotes: 2