Reputation: 12276
As you probably know, in K-NN, the decision is usually taken according to the "majority vote", and not according to some threshold - i.e. there is no parameter to base a ROC curve on.
Note that in the implementation of my K-NN classifier, the votes don't have equal weights. I.e. the weight for each "neighbor" is e^(-d), where d is the distance between the tested sample and the neighbor. This measure gives higher weights for the votes of the nearer neighbors among the K neighbors.
My current decision rule is that if the sum of the scores of the positive neighbors is higher than the sum of the scores of the negative samples, then my classifier says POSITIVE, else, it says NEGATIVE.
But - There is no threshold.
Then, I thought about the following idea:
Deciding on the class of the samples which has a higher sum of votes, could be more generally described as using the threshold 0, for the score computed by: (POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
So I thought changing my decision rule to be using a threshold on that measure, and plotting a ROC curve basing on thresholds on the values of
(POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
Does it sound like a good approach for this task?
Upvotes: 1
Views: 3760
Reputation: 66835
Yes, it is more or less what is typically used. If you take a look at scikit-learn it has weights in knn, and they also have predit_proba, which gives you a clear decision threshold. Typically you do not want to condition on a difference, however, but rather ratio
votes positive / (votes negative + votes positive) < T
this way, you know that you just have to "move" threshold from 0 to 1, and not arbitrary values. it also now has a clear interpretation - as an internal probability estimate that you consider "sure enough". By default T = 0.5, if the probability is above 50% you classify as positive, but as said before - you can do anything wit it.
Upvotes: 2