Reputation: 5792
I'm trying to understand and plot TPR/FPR for different types of classifiers. I'm using kNN, NaiveBayes and Decision Trees in R. With kNN I'm doing the following:
clnum <- as.vector(diabetes.trainingLabels[,1], mode = "numeric")
dpknn <- knn(train = diabetes.training, test = diabetes.testing, cl = clnum, k=11, prob = TRUE)
prob <- attr(dpknn, "prob")
tstnum <- as.vector(diabetes.testingLabels[,1], mode = "numeric")
pred_knn <- prediction(prob, tstnum)
pred_knn <- performance(pred_knn, "tpr", "fpr")
plot(pred_knn, avg= "threshold", colorize=TRUE, lwd=3, main="ROC curve for Knn=11")
where diabetes.trainingLabels[,1]
is a vector of labels (class) I want to predict, diabetes.training
is the training data and diabetes.testing
is the testing data.
The plot looks like the following:
The values stored in the prob
attribute is a numeric vector (decimal between 0 and 1). I convert the class labels factor into numbers and then I can use it with the prediction/performance function from the ROCR library. Not 100% sure I'm doing it correctly but at least it works.
For the NaiveBayes and Decision Trees though, with prob/raw parameter specified in the predict function, I don't get a single numeric vector but a vector of lists or matrix where the probability for each class is specified (I guess), eg:
diabetes.model <- naiveBayes(class ~ ., data = diabetesTrainset)
diabetes.predicted <- predict(diabetes.model, diabetesTestset, type="raw")
and diabetes.predicted
is:
tested_negative tested_positive
[1,] 5.787252e-03 0.9942127
[2,] 8.433584e-01 0.1566416
[3,] 7.880800e-09 1.0000000
[4,] 7.568920e-01 0.2431080
[5,] 4.663958e-01 0.5336042
The question is how to use it to plot the ROC curve and why in kNN I get one vector and for other classifiers, I get them separate for both classes?
Upvotes: 21
Views: 2127
Reputation: 16
Looks like you are doing something fundamentally wrong.
Ideally KNN graph looks like the one above. Here are a few points you can use.
print(model_name.predict(test))
print(model_name.kneighbors(test)[1])
Upvotes: 0
Reputation: 9656
The ROC curve you provided for knn11
classifier looks off - it is below the diagonal indicating that your classifier assigns class labels correctly less than 50% of the time. Most likely what happened there is that you provided wrong class labels or wrong probabilities. If in training you used class labels of 0 and 1 - those same class labels should be passed to ROC curve in the same order (without 0 and one flipping).
Another less likely possibility is that you have a very weird dataset.
ROC curve was developed to call events from the radar. Technically it is closely related to predicting an event - probability that you correctly guess the even of a plane approaching from the radar. So it uses one probability. This can be confusing when someone does classification on two classes where "hit" probabilities are not evident, like in your case where you have cases and controls.
However any two class classification can be termed in terms of "hits" and "misses" - you just have to select a class which you will call an "event". In your case having diabetes might be called an event.
So from this table:
tested_negative tested_positive [1,] 5.787252e-03 0.9942127 [2,] 8.433584e-01 0.1566416 [3,] 7.880800e-09 1.0000000 [4,] 7.568920e-01 0.2431080 [5,] 4.663958e-01 0.5336042
You would only have to select one probability - that of an event - probably "tested_positive". Another one "tested_negative" is just 1-tested_positive
because when classifier things that a particular person has diabetes with 79% chance - he at the same time "thinks" that there is a 21% chance of that person not having diabetes. But you only need one number to express this idea, so knn only returns one, while other classifier can return two.
I don't know which library you used for decision trees so cannot help with the output of that classifier.
Upvotes: 0