Reputation: 255
I try to fit a Lasso regression model using glmnet(). As I have never worked with Lasso regression before, I tried to get along with tutorials but when applying the model, it always results with the following error:
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,:
one multinomial or binomial class has 1 or 0 observations; not allowed
Working with the dataset from this question (https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome) it seems that the dependent variable, the y, has to consist only of 0 and 1. Whenever I set one of the observation values of y to 2 or anything else than 0 or 1, it results in this error.
This is my code:
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
x_vars <- as.matrix(data.frame(data$x1, data$x2, data$x3))
lasso_cv <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)
x_vars_2 <- model.matrix(data$y ~ data$x1 + data$x2 + data$x3)[, -1]
lasso_cv_2 <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)
And this is how my dataset looks like:
The problem is, that in my data, the y variable represents the number of crimes, so it has integer values between 0 and 1000. I cannot set the value to 0 and 1 only. How does it work to use these data to apply a Lasso regression?
Upvotes: 1
Views: 1457
Reputation: 46968
As @Gregor noted, what you have is count data, and it should be regression and not classification. Using an example dataset, this is how you can implement it:
library(MASS)
library(glmnet)
data(Insurance)
Your response variable should be numeric:
str(Insurance)
'data.frame': 64 obs. of 5 variables:
$ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ Age : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
$ Holders : int 197 264 246 1680 284 536 696 3582 133 286 ...
$ Claims : int 38 35 20 156 63 84 89 400 19 52 ...
Now we set the predictors and response variables:
y = Insurance$Claims
X = model.matrix(Claims ~ .,data=Insurance)
Run a cv to find the best lambda (if you don't know your L1 norm):
fit = cv.glmnet(x=X,y=y,family="poisson")
pred = predict(fit,X,s=fit$lambda.1se)
The prediction is in log scale, so to compare with your actual
plot(log(y),pred,xlab="log (actual)",ylab="log (predicted)")
Upvotes: 1