Reputation: 1215
I am attempting to plot a ROC curve with classification trees probabilities. However, when I plot the curve, it is absent. I am trying to plot the ROC curve and then find the AUC value from the area under the curve. Does anyone know how to fix this? Thank you if you can. The binary column Risk stands for risk misclassification, which I presume is my label. Should I be applying the ROC curve equation at a different point in my code?
Here is the data frame:
library(ROCR)
data(Risk.table)
pred = prediction(Risk.table$Predicted.prob, Risk.table2$Risk)
perf = performance(pred, measure="tpr", x.measure="fpr")
perf
plot(perf)
Predicted.prob Actual.prob predicted actual Risk
1 0.5384615 0.4615385 G8 V4 0
2 0.1212121 0.8787879 V4 V4 1
3 0.5384615 0.4615385 G8 G8 1
4 0.9000000 0.1000000 G8 G8 1
5 0.1212121 0.8787879 V4 V4 1
6 0.1212121 0.8787879 V4 V4 1
7 0.9000000 0.1000000 G8 G8 1
8 0.5384615 0.4615385 G8 V4 0
9 0.5384615 0.4615385 G8 V4 0
10 0.1212121 0.8787879 V4 G8 0
11 0.1212121 0.8787879 V4 V4 1
12 0.9000000 0.1000000 G8 V4 0
13 0.9000000 0.1000000 G8 V4 0
14 0.1212121 0.8787879 G8 V4 1
15 0.9000000 0.1000000 G8 G8 1
16 0.5384615 0.4615385 G8 V4 0
17 0.9000000 0.1000000 G8 V4 0
18 0.1212121 0.8787879 V4 V4 1
19 0.5384615 0.4615385 G8 V4 0
20 0.1212121 0.8787879 V4 V4 1
21 0.9000000 0.1000000 G8 G8 1
22 0.5384615 0.4615385 G8 V4 0
23 0.9000000 0.1000000 G8 V4 0
24 0.1212121 0.8787879 V4 V4 1
#Split data 70:30 after shuffling the data frame
index<-1:nrow(LDA.scores1)
trainindex.LDA3=sample(index, trunc(length(index)*0.70),replace=FALSE)
LDA.70.trainset3<-shuffle.cross.validation2[trainindex.LDA3,]
LDA.30.testset3<-shuffle.cross.validation2[-trainindex.LDA3,]
tree.split3<-rpart(Family~., data=LDA.70.trainset3, method="class")
tree.split3
summary(tree.split3)
print(tree.split3)
plot(tree.split3)
text(tree.split3,use.n=T,digits=0)
printcp(tree.split3)
tree.split3
res3=predict(tree.split3,newdata=LDA.30.testset3)
res4=as.data.frame(res3)
res4$predicted<-NA
res4$actual<-NA
for (i in 1:length(res4$G8)){
if(res4$R2[i]>res4$V4[i]) {
res4$predicted[i]<-"G8"
}
else {
res4$predicted[i]<-"V4"
}
print(i)
}
res4
res4$actual<-LDA.30.testset3$Family
res4
Risk.table$Risk<-NA
Risk.table
for (i in 1:length(Risk.table$Risk)){
if(Risk.table$predicted[i]==res4$actual[i]) {
Risk.table$Risk[i]<-1
}
else {
Risk.table$Risk[i]<-0
}
print(i)
}
#Confusion Matrix
cm=table(res4$actual, res4$predicted)
names(dimnames(cm))=c("actual", "predicted")
index<-1:nrow(significant.lda.Wilks2)
trainindex.LDA.help1=sample(index, trunc(length(index)*0.70), replace=FALSE)
sig.train=significant.lda.Wilks2[trainindex.LDA.help1,]
sig.test=significant.lda.Wilks2[-trainindex.LDA.help1,]
library(klaR)
nbmodel<-NaiveBayes(Family~., data=sig.train)
prediction<-predict(nbmodel, sig.test)
NB<-as.data.frame(prediction)
colnames(NB)<-c("Actual", "Predicted.prob", "acual.prob")
NB$actual2 = NA
NB$actual2[NB$Actual=="G8"] = 1
NB$actual2[NB$Actual=="V4"] = 0
NB2<-as.data.frame(NB)
plot(fit.perf, col="red"); #Naive Bayes
plot(perf, col="blue", add=T); #Classification Tree
abline(0,1,col="green")
library(caret)
library(e1071)
train_control<-trainControl(method="repeatedcv", number=10, repeats=3)
model<-train(Matriline~., data=LDA.scores, trControl=train_control, method="nb")
predictions <- predict(model, LDA.scores[,2:13])
confusionMatrix(predictions,LDA.scores$Family)
Confusion Matrix and Statistics
Reference
Prediction V4 G8
V4 25 2
G8 5 48
Accuracy : 0.9125
95% CI : (0.828, 0.9641)
No Information Rate : 0.625
P-Value [Acc > NIR] : 4.918e-09
Kappa : 0.8095
Mcnemar's Test P-Value : 0.4497
Sensitivity : 0.8333
Specificity : 0.9600
Pos Pred Value : 0.9259
Neg Pred Value : 0.9057
Prevalence : 0.3750
Detection Rate : 0.3125
Detection Prevalence : 0.3375
Balanced Accuracy : 0.8967
'Positive' Class : V4
Upvotes: 2
Views: 6184
Reputation: 16121
I have various things to point out:
1) I think your code has to be Family ~ .
inside your rpart command.
2) In your initial table I can see a value W3
in your predicted column. Does that mean you don’t have a binary dependent variable? ROC curves work with binary data, so check it.
3) Your predicted and actual probabilities in your initial table always sum to 1. Is that reasonable? I think they represent something else, so you might consider changing names in case they confuse you in the future.
4) I think you’re confused about how ROC works and what inputs it needs. Your Risk
column uses 1 to represent a correct prediction and 0 to represent a wrong prediction. However, the ROC curve needs 1 to represent one class and 0 to represent the other class. In simple words, the command is prediction(predictions, labels)
where predictions
are your predicted probabilities and labels
are the true class/levels of your dependent variable.
Check the following code:
dt = read.table(text="
Id Predicted.prob Actual.prob predicted actual Risk
1 0.5384615 0.4615385 G8 V4 0
2 0.1212121 0.8787879 V4 V4 1
3 0.5384615 0.4615385 G8 G8 1
4 0.9000000 0.1000000 G8 G8 1
5 0.1212121 0.8787879 V4 V4 1
6 0.1212121 0.8787879 V4 V4 1
7 0.9000000 0.1000000 G8 G8 1
8 0.5384615 0.4615385 G8 V4 0
9 0.5384615 0.4615385 G8 V4 0
10 0.1212121 0.8787879 V4 G8 0
11 0.1212121 0.8787879 V4 V4 1
12 0.9000000 0.1000000 G8 V4 0
13 0.9000000 0.1000000 G8 V4 0
14 0.1212121 0.8787879 W3 V4 1
15 0.9000000 0.1000000 G8 G8 1
16 0.5384615 0.4615385 G8 V4 0
17 0.9000000 0.1000000 G8 V4 0
18 0.1212121 0.8787879 V4 V4 1
19 0.5384615 0.4615385 G8 V4 0
20 0.1212121 0.8787879 V4 V4 1
21 0.9000000 0.1000000 G8 G8 1
22 0.5384615 0.4615385 G8 V4 0
23 0.9000000 0.1000000 G8 V4 0
24 0.1212121 0.8787879 V4 V4 1", header=T)
library(ROCR)
roc_pred <- prediction(dt$Predicted.prob, dt$Risk)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")
The ROC curve is :
When you create a new column actual2
where you have 1 instead of G8 and 0 instead of V4:
dt$actual2 = NA
dt$actual2[dt$actual=="G8"] = 1
dt$actual2[dt$actual=="V4"] = 0
roc_pred <- prediction(dt$Predicted.prob, dt$actual2)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")
5) As @eipi10 mentioned above, you should try to get rid of the for loops in your code.
Upvotes: 1