rholeepoly
rholeepoly

Reputation: 53

predict cv.glmnet giving me identical values for every row r

I'm using cv.glmnet on a binary dataset of genotypes to predict a continuous variable phenotype. Data looks something like this but with >200 genes:

       Pheno K00074 K00100 K00179 K00180
1  18.063630      0      0      0      0
2  16.746644      0      0      0      0
3  16.016194      1      0      0      0
4  -1.469207      1      1      0      0
5  -3.047956      1      0      1      1
6  15.274531      1      0      0      0 

My code for the cv.glmnet and predict looks like this:

cv.lasso <- cv.glmnet(x = as.matrix(zx), y = unlist(zy), alpha = 1,
                      type.measure = 'mse',keep = TRUE) # runs the model
    
prediction<-predict(cv.lasso,s = cv.lasso$lambda.1se,
                    newx = as.matrix(batch1218.kegg[,-1]),type = 'class')

where zx is just binary columns of gene presence/absence, and zy is the phenotype column. batch1218.kegg is a new set of genotypic data that I want to use to predict the phenotype. My prediction ends up looking like this though:

         1
1 6.438563
2 6.438563
3 6.438563
4 6.438563
5 6.438563
6 6.438563

Where all the numbers are the same for every row. I'm getting the same thing happen with other phenotypes as well. I'm thinking the problem might be that I'm only working with ~38 rows of phenotypic data in comparison to a large number of predictor variables. But wanted to see if there's maybe another problem I'm dealing with.

Upvotes: 3

Views: 1084

Answers (2)

StupidWolf
StupidWolf

Reputation: 47008

Here is reproducing your error using an example dataset:

library(glmnet)

data = data.frame(Pheno=rnorm(200),K00074=rbinom(200,1,0.5),
K00100=rbinom(200,1,0.5),K00179=rbinom(200,1,0.5),K00180=rbinom(200,1,0.5))

zx = data[1:100,-1]
zy = data$Pheno[1:100]

batch1218.kegg = data[101:200,]

cv.lasso <- cv.glmnet(x = as.matrix(zx), y = unlist(zy), alpha = 1,
                      type.measure = 'mse',keep = TRUE) # runs the model

prediction<-predict(cv.lasso,s = cv.lasso$lambda.1se,
                    newx = as.matrix(batch1218.kegg[,-1]),type = 'class')

head(prediction)
             1
101 0.07435786
102 0.07435786
103 0.07435786
104 0.07435786
105 0.07435786
106 0.07435786

Your dependent variable is continuous, i.e this is regression, the type should not be 'class', but in any case, if all the best fit comes at a reducing all your variables to zero, you get only the intercept as non-zero, hence all predictions are the same value:

coef(cv.lasso,s=cv.lasso$lambda.1se)
5 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 0.07435786
K00074      .         
K00100      .         
K00179      .         
K00180      . 

Looking at your dataframe, if you only have 4 independent variable / predictors, lasso is an overkill. You can just apply a simple linear regression:

head(predict(glm(Pheno ~ .,data=data[1:100,])))
          1           2           3           4           5           6 
 0.21560938  0.28477818  0.28477818 -0.05017303 -0.11487138 -0.18404019 

Upvotes: 0

Sebastian
Sebastian

Reputation: 1970

This usually happens when the lambda you choose is wrong. Try "lambda.min" instead

Upvotes: 1

Related Questions