Desta Haileselassie Hagos
Desta Haileselassie Hagos

Reputation: 26096

Lasso: Cross-validation for glmnet

I am using cv.glmnet() to perform cross-validation, by default 10-fold

library(Matrix)
library(tm)
library(glmnet)
library(e1071)
library(SparseM)
library(ggplot2)

trainingData <- read.csv("train.csv", stringsAsFactors=FALSE,sep=",", header = FALSE)
testingData  <- read.csv("test.csv",sep=",", stringsAsFactors=FALSE, header = FALSE)

x = model.matrix(as.factor(V42)~.-1, data = trainingData)
crossVal <- cv.glmnet(x=x, y=trainingData$V42, family="multinomial", alpha=1)
plot(crossVal)

I am having the following error message

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  one multinomial or binomial class has 1 or 0 observations; not allowed

But as it is shown below, I don't seem to have an observation level with counts of either 0 or 1.

>table(trainingData$V42)

       back buffer_overflow       ftp_write    guess_passwd            imap         ipsweep            land      loadmodule        multihop 
        956              30               8              53              11            3599              18               9               7 
    neptune            nmap          normal            perl             phf             pod       portsweep         rootkit           satan 
      41214            1493           67343               3               4             201            2931              10            3633 
      smurf             spy        teardrop     warezclient     warezmaster 
       2646               2             892             890              20 

Any pointers?

Upvotes: 0

Views: 3993

Answers (1)

Hong Ooi
Hong Ooi

Reputation: 57686

cv.glmnet does N-fold crossvalidation with N=10 by default. This means it splits your data into 10 subsets, then trains a model on 9 of the 10 and tests it on the remaining 1. It repeats this, leaving out each subset in turn.

Your data is sparse enough that sometimes, the training subset will run into the problem encountered here (and in your previous question). The best solution is to reduce the number of classes in your response by combining the rarer classes (do you really need to get a predicted probability for spy or perl for example).

Also, if you're doing glmnet crossvalidation and constructing a model matrix, you could use the glmnetUtils package I wrote to streamline the process.

Upvotes: 3

Related Questions