Reputation: 53
I'm using cv.glmnet
on a binary dataset of genotypes to predict a continuous variable phenotype. Data looks something like this but with >200 genes:
Pheno K00074 K00100 K00179 K00180
1 18.063630 0 0 0 0
2 16.746644 0 0 0 0
3 16.016194 1 0 0 0
4 -1.469207 1 1 0 0
5 -3.047956 1 0 1 1
6 15.274531 1 0 0 0
My code for the cv.glmnet
and predict
looks like this:
cv.lasso <- cv.glmnet(x = as.matrix(zx), y = unlist(zy), alpha = 1,
type.measure = 'mse',keep = TRUE) # runs the model
prediction<-predict(cv.lasso,s = cv.lasso$lambda.1se,
newx = as.matrix(batch1218.kegg[,-1]),type = 'class')
where zx
is just binary columns of gene presence/absence, and zy
is the phenotype column. batch1218.kegg
is a new set of genotypic data that I want to use to predict the phenotype. My prediction ends up looking like this though:
1
1 6.438563
2 6.438563
3 6.438563
4 6.438563
5 6.438563
6 6.438563
Where all the numbers are the same for every row. I'm getting the same thing happen with other phenotypes as well. I'm thinking the problem might be that I'm only working with ~38 rows of phenotypic data in comparison to a large number of predictor variables. But wanted to see if there's maybe another problem I'm dealing with.
Upvotes: 3
Views: 1084
Reputation: 47008
Here is reproducing your error using an example dataset:
library(glmnet)
data = data.frame(Pheno=rnorm(200),K00074=rbinom(200,1,0.5),
K00100=rbinom(200,1,0.5),K00179=rbinom(200,1,0.5),K00180=rbinom(200,1,0.5))
zx = data[1:100,-1]
zy = data$Pheno[1:100]
batch1218.kegg = data[101:200,]
cv.lasso <- cv.glmnet(x = as.matrix(zx), y = unlist(zy), alpha = 1,
type.measure = 'mse',keep = TRUE) # runs the model
prediction<-predict(cv.lasso,s = cv.lasso$lambda.1se,
newx = as.matrix(batch1218.kegg[,-1]),type = 'class')
head(prediction)
1
101 0.07435786
102 0.07435786
103 0.07435786
104 0.07435786
105 0.07435786
106 0.07435786
Your dependent variable is continuous, i.e this is regression, the type should not be 'class', but in any case, if all the best fit comes at a reducing all your variables to zero, you get only the intercept as non-zero, hence all predictions are the same value:
coef(cv.lasso,s=cv.lasso$lambda.1se)
5 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.07435786
K00074 .
K00100 .
K00179 .
K00180 .
Looking at your dataframe, if you only have 4 independent variable / predictors, lasso is an overkill. You can just apply a simple linear regression:
head(predict(glm(Pheno ~ .,data=data[1:100,])))
1 2 3 4 5 6
0.21560938 0.28477818 0.28477818 -0.05017303 -0.11487138 -0.18404019
Upvotes: 0
Reputation: 1970
This usually happens when the lambda you choose is wrong. Try "lambda.min" instead
Upvotes: 1