Pu Ke
Pu Ke

Reputation: 1

How to do the CV test to examine the classification error of LDA in R

Please give me a simple example. I am in worry! I have tried the errorest function and do it as the example as it give for 10-fold cv of LDA. But when I used my own data, it just said the predict is not numeric. I don't know why! Thank you! The R code is like this. I want to do the binary LDA so I generate the data:

library(MASS)
n=500
#generate x1 and x2. 
Sigma=matrix(c(2,0,0,1),nrow=2,ncol=2)
#Logistic model with parameter{1,4,-2}
beta.star=c(1,4,-2)
Xtilde=mvrnorm(n=n,mu=c(0.5,2),Sigma=Sigma)
X=cbind(1,Xtilde)
z=X%*%beta.star
#pass througn an inv-logit function
pr=exp(z)/(1+exp(z))
#Simulate binary response
# The "probability of respoonse is a vector"
y=rbinom(n,1,pr)

Then I use the LDA to get the model:

library(MASS)
df.cv=data.frame(V1=Xtilde[,1],V2=Xtilde[,2])
exper1<-lda(y~V1+V2,data=df.d)
plda<-predict(exper1,newdata=df.cv)

Finally I want to use the CV with th original data and see the error. I do this which is wrong:

mypredict.lda <- function(object, newdata)
  predict(object, newdata = newdata)$class
errorest(y ~ ., data=data.frame(da), model=lda,estimator ="cv", predict= as.numeric(mypredict.lda))

What should I do to get the error with CV?

Upvotes: 0

Views: 3495

Answers (1)

Tad Dallas
Tad Dallas

Reputation: 1189

So we start with all your previous code setting up fake data

library(MASS)
n=500
#generate x1 and x2. 
Sigma=matrix(c(2,0,0,1),nrow=2,ncol=2)

#Logistic model with parameter{1,4,-2}
beta.star=c(1,4,-2)
Xtilde=mvrnorm(n=n,mu=c(0.5,2),Sigma=Sigma)
X=cbind(1,Xtilde)
z=X%*%beta.star

#pass througn an inv-logit function
pr=exp(z)/(1+exp(z))
#Simulate binary response
y=rbinom(n,1,pr)

#Now we do the LDA
df.cv=data.frame(V1=Xtilde[,1],V2=Xtilde[,2])

Below, we divide the data into two parts; a training set and a test set. If you want to do a 10 fold cross validation, you would use 0.9 instead of 0.8 (0.8 corresponds to 80% train, 20% test, which is five-fold cross validation)

library(ROCR)
inds=sample(1:nrow(df.cv),0.8*nrow(df.cv))
df.train=df.cv[inds,]
df.test=df.cv[-inds,]
train.model = lda(y[inds] ~ V1+V2, data=df.train)

From the trained model, we predict on the test set. Below, I determine the predicted values, and then assess the accuracy of the predictions. Here, I use a ROC curve, but you can use whatever metric you'd like, I guess. I didn't understand what you meant by error.

preds=as.numeric(predict(train.model, df.test)$class)
actual=y[-inds]
aucCurve=performance(prediction(preds,actual), "tpr", "fpr")
plot(aucCurve)

The area under this ROC curve is a measure of predictive accuracy. Values closer to 1 mean you have good predictive capability.

auc=performance(prediction(preds,actual), "auc")
[email protected]

Hopefully this helped, and isn't horribly incorrect. Other folks please chime in with corrections or clarifications.

Upvotes: 1

Related Questions