pts2
pts2

Reputation: 11

Extracting the t-value for only one predictor variable in a large series of lm's using R

Apologies if this has been answered - I did not find quite what I was looking for. I have 2 predictor variables that I want to use to do a huge number of separate multiple regressions (i.e., on each column of a >3,000,000 column matrix). I know how to run the lm command to do this, such that I get a separate model for each column using the predictors. However, I need to extract the t-value (and separately, p-value) for only ONE of the predictor variables (the other one is a control variable).

For example:

    nXvariables <- 2
    nYvariables <- 4
    nObs <- 15
    x <- matrix(rnorm(nXvariables*nObs),nrow=nObs)
    y <- matrix(rnorm(nYvariables*nObs),nrow=nObs)
    allModels <-lm(y~x)
    summary(allModels)

    Response Y1 :

     Call:
     lm(formula = Y1 ~ x)

     Residuals:
         Min       1Q   Median       3Q      Max 
    -1.51271 -0.72176 -0.07585  0.42939  2.02093 

     Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)   0.2715     0.3081   0.881    0.396
     x1            0.1937     0.2969   0.652    0.526
     x2            0.5227     0.3336   1.567    0.143

     Residual standard error: 1.049 on 12 degrees of freedom
     Multiple R-squared:  0.1968,   Adjusted R-squared:  0.06292 
     F-statistic:  1.47 on 2 and 12 DF,  p-value: 0.2685


     Response Y2 :

     Call:
     lm(formula = Y2 ~ x)

     Residuals:
        Min      1Q  Median      3Q     Max 
    -1.4219 -0.4949 -0.1893  0.6164  1.3186 

     Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)  -0.1474     0.2624  -0.562    0.585
     x1           -0.4246     0.2528  -1.679    0.119
     x2           -0.2980     0.2841  -1.049    0.315

     Residual standard error: 0.8934 on 12 degrees of freedom
     Multiple R-squared:  0.2509,   Adjusted R-squared:  0.1261 
     F-statistic:  2.01 on 2 and 12 DF,  p-value: 0.1766


     Response Y3 :

     Call:
     lm(formula = Y3 ~ x)

     Residuals:
         Min       1Q   Median       3Q      Max 
    -1.22364 -0.53426 -0.01539  0.14891  1.80698 

     Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)  -0.2900     0.2600  -1.116    0.286
     x1            0.2976     0.2505   1.188    0.258
     x2            0.1262     0.2814   0.448    0.662

     Residual standard error: 0.8851 on 12 degrees of freedom
     Multiple R-squared:  0.1204,   Adjusted R-squared:  -0.02617 
     F-statistic: 0.8215 on 2 and 12 DF,  p-value: 0.4631


     Response Y4 :

     Call:
     lm(formula = Y4 ~ x)

     Residuals:
        Min      1Q  Median      3Q     Max 
    -2.0015 -0.2455  0.1279  0.5376  1.1856 

     Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept) -0.09163    0.27600  -0.332    0.746
     x1          -0.28386    0.26592  -1.067    0.307
     x2          -0.05683    0.29878  -0.190    0.852

     Residual standard error: 0.9396 on 12 degrees of freedom
     Multiple R-squared:  0.09006,  Adjusted R-squared:  -0.06159 
     F-statistic: 0.5939 on 2 and 12 DF,  p-value: 0.5676

This lists 4 separate models, where the 2 predictor (X) variables are separately used to predict each of the 4 (Y) dependent ("response") variables. However, I need a vector that lists the t-values for the x1 variable only (i.e., the value in the "t value" column of each of the "Coefficients" table that corresponds to the x1 variable alone). So the vector I need, for this example, would look like:

    0.652, -1.679, 1.188, -1.067

I also need a separate vector that has the corresponding probability values for each of these (i.e., the corresponding x1 values in each of the "Coefficients" tables in the "Pr(>|t|)" column).

I tried saving summary(allModels) into an R object, and then trying to extract subsets of that data, but one of the levels is "Response Y1" etc., and the space causes it to bail (I think).

Upvotes: 1

Views: 204

Answers (3)

pts2
pts2

Reputation: 11

The suggestions are very helpful, but unfortunately the biggest limitation for my system was RAM. The solution, given this, was to do a loop, doing the lm() for each column of the matrix one at a time and extracting the coefficients each time. Below is the code that worked for me (it took my system ~2.5 hours to complete >3,000,000 columns). Note that it turns out to be MUCH faster to make empty vectors of the right length beforehand, and put the values in the correct space one at a time, rather than using append() (Something to do with the way R deals with vectors...). I used Marco Sandri's ideas about how to extract the coefficients I wanted (thank you!). I also cobbled together a simple progress indicator, to help me know that it was running consistently.

Assuming: Data is in matrix.with.my.data, and that variable1, variable2, and variable3 are vectors of the correct length and whose data order matches that of the matrix rows

    # calculate total number of columns(voxels)
    totalColumns<-ncol(matrix.with.my.data)

    # get the current time so we can estimate remaining time for our counter
    startTime<-Sys.time()

    # initialize the variables we will be putting the results for each column into:
    variable1.slope <- vector(,totalColumns)
    variable1.tvalue <- vector(,totalColumns)
    variable1.pvalue <- vector(,totalColumns)

    variable2.slope <- vector(,totalColumns)
    variable2.tvalue <- vector(,totalColumns)
    variable2.pvalue <- vector(,totalColumns)

    variable3.slope <- vector(,totalColumns)
    variable3.tvalue <- vector(,totalColumns)
    variable3.pvalue <- vector(,totalColumns)

    # here is the loop that does the work, one column at a time:
    for (column in 1:totalColumns)
    {

    # progress indicator: print out % of total columns completed (and the time stamp) every 5000 columns
    # Modulus operation
    if(column %% 5000==0) 
    {
    # Print number completed on screen, and estimate time remaining
    cat(paste0(round(column/totalColumns*100, digits=1), "% completed; ", "Estimated time remaining: ", round((((as.numeric(Sys.time())-as.numeric(startTime))/voxel)*(totalColumns-voxel))/60), digits=0), " minutes\n"))
    }

    model.summary<-summary(lm(matrix.with.my.data[, column] ~ variable1 + variable2 + variable3))

    # pull out the variable1 coefficients, and append them to the appropriate vector:
    variable1.slope[column]<- coefficients(model.summary)["variable1","Estimate"]
    variable1.tvalue[column]<- coefficients(model.summary)["variable1","t value"]
    variable1.pvalue[column]<- coefficients(model.summary)["variable1","Pr(>|t|)"]

    # pull out the variable2 coefficients, and append them to the appropriate vector:
    variable2.slope[column]<- coefficients(model.summary)["variable2","Estimate"]
    variable2.tvalue[column]<- coefficients(model.summary)["variable2","t value"]
    variable2.pvalue[column]<- coefficients(model.summary)["variable2","Pr(>|t|)"]

    # pull out the variable3 coefficients, and append them to the appropriate vector:
    variable3.slope[column]<- coefficients(model.summary)["variable3","Estimate"]
    variable3.tvalue[column]<- coefficients(model.summary)["variable3","t value"]
    variable3.pvalue[column]<- coefficients(model.summary)["variable3","Pr(>|t|)"]
    }

Upvotes: 0

Benjamin
Benjamin

Reputation: 17369

broom makes this pretty easy.

library(broom)

allModelsFrame <- tidy(allModels) 
allModelsFrame[allModelsFrame$term == "x1", ]

Edit

Since the size of your project is creating an issue with computation time, here's another approach that obtains the coefficients and calculates the statistics independently.

coef <- as.vector(coef(allModels))
se <- sqrt(diag(vcov(allModels)))
t <- coef / se
p <- 2 * (1 - pt(abs(t), df = allModels$df.residual, lower.tail = TRUE))

I find it interesting, however, that when I compare this new code with Marco's answer, Marco's is a bit faster. I'd recommend using his solution.

If his solution is still unbearably slow, I think you will need to look into building one model at a time, and parallelizing the process.

library(parallel)
cl <- makeCluster(7)
clusterExport(cl, c("x", "y"))

parLapply(cl = cl,
              X = seq_len(ncol(y)),
              fun = function(i){
                coef(summary(lm(y[, i] ~ x)))
              })

Upvotes: 2

Marco Sandri
Marco Sandri

Reputation: 24252

Another solution:

nXvariables <- 2; nYvariables <- 4;  nObs <- 15
set.seed(1)
x <- matrix(rnorm(nXvariables*nObs),nrow=nObs)
y <- matrix(rnorm(nYvariables*nObs),nrow=nObs)
allModels <-lm(y~x)

out <- summary(allModels)
(cf <- coef(out))

# Response Y1 :
#               Estimate Std. Error    t value   Pr(>|t|)
# (Intercept)  0.1420606  0.1562425  0.9092312 0.38112719
# x1          -0.2494874  0.1586252 -1.5728108 0.14174445
# x2          -0.4153935  0.1886099 -2.2023954 0.04793035
# 
# Response Y2 :
#                Estimate Std. Error    t value   Pr(>|t|)
# (Intercept)  0.14474061  0.2171560  0.6665283 0.51768356
# x1          -0.02212358  0.2204676 -0.1003485 0.92172473
# x2           0.51118985  0.2621422  1.9500477 0.07492154
# 
# Response Y3 :
#                Estimate Std. Error    t value  Pr(>|t|)
# (Intercept)  0.20094144  0.3323848  0.6045446 0.5567399
# x1          -0.15610930  0.3374536 -0.4626096 0.6519203
# x2          -0.08045549  0.4012420 -0.2005161 0.8444352
# 
# Response Y4 :
#                Estimate Std. Error    t value  Pr(>|t|)
# (Intercept) 0.023140470  0.1813438 0.12760553 0.9005746
# x1          0.003848764  0.1841092 0.02090478 0.9836652
# x2          0.265064518  0.2189112 1.21083145 0.2492681

Here are some ways for extracting data from the cf matrix:*

# Matrix of coefficients, std. errors, t values and p-values for Y1 response variable
cf[1]
# Response Y1 :
#               Estimate Std. Error    t value   Pr(>|t|)
# (Intercept)  0.1420606  0.1562425  0.9092312 0.38112719
# x1          -0.2494874  0.1586252 -1.5728108 0.14174445
# x2          -0.4153935  0.1886099 -2.2023954 0.04793035

# Vector of coefficients for Y1 response variable
cf[1]$'Response Y1'[,1]

# Vector of p-values for Y1 response variable
cf[1]$'Response Y1'[,4]

# Vector of x1 coefficients for all the response variables
sapply(cf,"[",2,1)
#
#  Response Y1  Response Y2  Response Y3  Response Y4 
# -0.249487417 -0.022123582 -0.156109301  0.003848764

Upvotes: 2

Related Questions