KLee
KLee

Reputation: 205

Regularized regression using glmnet: No difference between groups?

I am using a regularized regression to select several proteins that best discriminate health condition (binary: either disease or no disease). The purpose of using it is to reduce dimension (variable selection) so that we can have a smaller set of proteins that best discriminate two groups.

Tuning parameters were chosen appropriately (I believe...) using a cv.glmnet function in R. (Specifically, alpha were chosen based on the prediction rate, and I used lambda.1se. Because the alpha chosen was 0.5, I am actually using the elastic net regression.)

As a result, 16 proteins (out of almost 500 proteins) had non-zero coefficients, and I assume that these are the proteins that best "discriminate" two conditions: either disease or no disease. For visualization, I made box plots using these selected proteins.

However, I noticed that one protein (TRAP1; bottom in the figure) does not show any noticeable mean difference or dispersion between two groups.

enter image description here

I started to wonder why then the regularized regression predicted that it can be one of the proteins that best discriminate the two health conditions?

Can anyone help me, please?

Thank you very much!!

Upvotes: 0

Views: 522

Answers (1)

StupidWolf
StupidWolf

Reputation: 46908

Lasso doesn't guarantee that your predictor will show statistical significance as you are testing it. Most of the time, if there's predictive power, you will see a correlation between the lasso selected variables and your dependent variable.

When you set alpha=0.5, it imposes a L1 penalty (minimizing the magnitude of your coefficients) and a L2 penalty (selecting variables with 0/1), so looking at the distribution of your variable, most of it is zero, so reducing it to a low value should make it "escape" the L2 penalty.

If the purpose of your lasso is to select variables, I would suggest two things 1, remove the features that have mostly zeros or low variance, like the one you showed above, and 2. run the full lasso with alpha = 1. Compare the accuracy of your predictions.

I don't have your data but I can illustrate with a dataset mtcars where i introduce a nonsense predictor and first calculate the pvalues of the predictors :

set.seed(111)
dat = iris
dat$Species = ifelse(dat$Species=="versicolor",1,0)
noise_var = data.frame(matrix(runif(750),nrow=150))
colnames(noise_var) = paste0("noise",1:ncol(noise_var))
dat  = cbind(noise_var,dat)

p = sapply(dat[,-ncol(dat)],function(i)cor.test(i,dat$Species)$p.value)

Fit using full lasso:

set.seed(222)
fit_lasso = cv.glmnet(x = as.matrix(dat[,-ncol(dat)]),y=dat[,ncol(dat)],alpha=1)
cbind(coef(fit_lasso,lambda="1se")[-1],p)

                                   p
noise1        0.0000000 5.089346e-01
noise2        0.0000000 2.722532e-01
noise3        0.0000000 9.564023e-02
noise4        0.0000000 7.743330e-01
noise5        0.0000000 7.324517e-02
Sepal.Length  0.0000000 3.341524e-01
Sepal.Width  -0.3073508 1.595624e-09
Petal.Length  0.0000000 1.329302e-02
Petal.Width   0.0000000 1.507473e-01

You can see the non-zero coefficients are significant, and some other significant ones are not included, thats due to correlation.

Now fit elastic net, you can see a noise variable is included with low coefficient:

set.seed(222)
fit_enet = cv.glmnet(x = as.matrix(dat[,-ncol(dat)]),y=dat[,ncol(dat)],alpha=0.5)
cbind(coef(fit_enet,lambda="1se")[-1],p)

                                    p
noise1        0.00000000 5.089346e-01
noise2        0.00000000 2.722532e-01
noise3        0.00000000 9.564023e-02
noise4        0.00000000 7.743330e-01
noise5       -0.04636756 7.324517e-02
Sepal.Length  0.00000000 3.341524e-01
Sepal.Width  -0.31452496 1.595624e-09
Petal.Length  0.00000000 1.329302e-02
Petal.Width   0.00000000 1.507473e-01

Also bear in mind the solution or selection is unstable, see this post, if you run it under a different seed, you get a different result:

set.seed(333)
fit_enet = cv.glmnet(x = as.matrix(dat[,-ncol(dat)]),y=dat[,ncol(dat)],alpha=0.5)
cbind(coef(fit_enet,lambda="1se")[-1],p)

                                   p
noise1        0.0000000 5.089346e-01
noise2        0.0000000 2.722532e-01
noise3        0.0000000 9.564023e-02
noise4        0.0000000 7.743330e-01
noise5        0.0000000 7.324517e-02
Sepal.Length  0.0000000 3.341524e-01
Sepal.Width  -0.2393709 1.595624e-09
Petal.Length  0.0000000 1.329302e-02
Petal.Width   0.0000000 1.507473e-01

You can of course run this over a few seeds to see how stable is a coefficient's selection, but bear in mind when there's a lot of correlation this is complicated

Upvotes: 2

Related Questions