screechOwl
screechOwl

Reputation: 28169

R Is there a way to do thresholding in linear regression?

I'm trying to do a linear regression but I'm only looking to use variables with positive coefficients (I think this is called hard-thresholding, but I'm not certain).

for example:

> summary(lm1)

Call:
lm(formula = value ~ ., data = intCollect1[, -c(1, 3)])

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6518  -0.2089  -0.0227   0.2035  15.2235 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.099763   0.024360   4.095 4.22e-05 ***
modelNum3802    0.208867   0.008260  25.285  < 2e-16 ***
modelNum8000   -0.086258   0.013104  -6.582 4.65e-11 ***
modelNum8001   -0.058225   0.010741  -5.421 5.95e-08 ***
modelNum8002   -0.001813   0.012087  -0.150 0.880776    
modelNum8003   -0.083646   0.011015  -7.594 3.13e-14 ***
modelNum8004    0.002521   0.010729   0.235 0.814254    
modelNum8005    0.301286   0.011314  26.630  < 2e-16 ***

In the above regression, I would only want to use models 3802, 8004 and 8005. Is there a way to do this without copying and pasting each variable name?

Upvotes: 3

Views: 1233

Answers (3)

John Jiang
John Jiang

Reputation: 282

You can also reformulate your linear regression model in the following way: label ~ sum(exp(\alpha_i) f_i)

the optimization target will be sum_j (label_j - sum_i(exp(\alpha_i) f_i))^2

This has no closed form solution but can be solved efficiently since it's convex in the \alpha_i's.

Once you compute the \alpha_i's, you can recast them as the regressors of a usual linear model by exponentiating them.

Upvotes: 0

screechOwl
screechOwl

Reputation: 28169

I found the nnls (non-negative least square) package to be worth looking at.

Upvotes: 1

flodel
flodel

Reputation: 89097

Instead of using lm, you can formulate your problem in terms of a Quadratic Programming:

Minimize the sum of the squared replication errors subject to the constraint that your linear coefficients are all positive.

Such problems can be solved using lsei from the limSolve package. Looking at your example, it would look a lot like this:

x.variables <- c("modelNum3802", "modelNum8000", ...)
num.var <- length(x.variables)

lsei(A = intCollect1[, x.variables],
     B = intCollect1$value,
     G = diag(num.var),
     H = rep(0, num.var))

Upvotes: 3

Related Questions