Reputation: 26096
I am using cv.glmnet()
to perform cross-validation, by default 10-fold
library(Matrix)
library(tm)
library(glmnet)
library(e1071)
library(SparseM)
library(ggplot2)
trainingData <- read.csv("train.csv", stringsAsFactors=FALSE,sep=",", header = FALSE)
testingData <- read.csv("test.csv",sep=",", stringsAsFactors=FALSE, header = FALSE)
x = model.matrix(as.factor(V42)~.-1, data = trainingData)
crossVal <- cv.glmnet(x=x, y=trainingData$V42, family="multinomial", alpha=1)
plot(crossVal)
I am having the following error message
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has 1 or 0 observations; not allowed
But as it is shown below, I don't seem to have an observation level with counts of either 0
or 1
.
>table(trainingData$V42)
back buffer_overflow ftp_write guess_passwd imap ipsweep land loadmodule multihop
956 30 8 53 11 3599 18 9 7
neptune nmap normal perl phf pod portsweep rootkit satan
41214 1493 67343 3 4 201 2931 10 3633
smurf spy teardrop warezclient warezmaster
2646 2 892 890 20
Any pointers?
Upvotes: 0
Views: 3993
Reputation: 57686
cv.glmnet
does N-fold crossvalidation with N=10 by default. This means it splits your data into 10 subsets, then trains a model on 9 of the 10 and tests it on the remaining 1. It repeats this, leaving out each subset in turn.
Your data is sparse enough that sometimes, the training subset will run into the problem encountered here (and in your previous question). The best solution is to reduce the number of classes in your response by combining the rarer classes (do you really need to get a predicted probability for spy
or perl
for example).
Also, if you're doing glmnet crossvalidation and constructing a model matrix, you could use the glmnetUtils package I wrote to streamline the process.
Upvotes: 3