Reputation: 187
I believe a lot of people have studied and are studying K-NN algorithm and I am one of them.
I have just encountered a sentence saying like "For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting k to a large value may lead to a large model bias.", I think it is straight forward to many people but not to me.
If you already know what it means, please help me understand what it is.
Upvotes: 2
Views: 5684
Reputation: 3194
OK, let's start from the top.
1. How does k-NN works?
You have base of n (k much smaller than n) points for which you know desired answer - you've probably obtained it from oracle. That set is called training set, as you provide it to virtual entity (k-NN classifier) so it can learn desired outcomes. By "point" we mean single example, described with features in some space that allows us to calculate distance.
When asked to classify (recognize) new point, you'll search your n points for k instances that are closest to new one. By "closest" we mean "with the shortest distance between feature vectors". Then you'll choose the answer basing on voting of those k points. For example, if k=5 and 3 points say that new one is of class A, and 2 - class B, you assume that new one is of class A. You have to specify some strategy for draws - probably falling back to 1-NN and returning the class of the closest point.
2. "For any given problem, a small value of k will lead to a large variance in predictions."
I'm assuming that by "large variance in predictions" author meant "many errors while classifying large amount of data".
Why is it so?
Because k-NN is very naive. It is intuitive that close points will probably be of the same class, but it isn't always so. For example, see point A on picture below. If we use k=1, then the closest point will be red, even though the answer should be green. For k=2 we get a draw between red and green and choose red, because it is closer.
Source: English wiki, with slight by-hand modification
In the end, that sentence IMO means "if k is small, you'll probably get many wrong results".
3. "Setting k to a large value may lead to a large model bias."
A "bias" is a tendency to give one answer more often then the other even though the questions are evenly distributed. With large k that may occur, but the question is "when".
Well, the answer to "when" is "when your training set is biased". "biased" here means that some classes are represented by more points than others.
Consider a training set where you have 5 points for class + and way more points for class *, as on picture below.
It may not represent real relation between classes, but that's all data you've got. On picture below classes are probably lineary separable, and point marked as red ? is probably +.
If you use k=10 you'll almost always get answer *. Best-case scenario is point marked with red ? - you'll take all five + points, another 5 * points and resolve a draw with 1-NN, using +, so the answer is right.
Anyway, in most cases your classifier will provide one specific answer, and that is exactly bias - one class will be returned more often.
That is not the case in previous example though - as the sentence states, it may lead to large bias, but doesn't have to.
In the end, that sentence means, that if your dataset is biased, then your classifier is more likely to be biased too for large k than for small k.
Source: my own
4. Summing up and further reading.
I hope that this will clarify some things to you.
If you need more, see this.
Upvotes: 2