Anwar Shaikh
Anwar Shaikh

Reputation: 1671

How to handle duplicate data points in k-Nearest Neighbor algorithm?

I have a large data-set on which I am running k-Nearest Neighbor classification algorithm. Consider a scenario k=3, I have a new (unclassified) point 'x', I find 3 nearest neighbors n1, n2, n3.

The problem is if n1, n2, n3 all have the exact same features, i.e. they are duplicate data points. In my case this is a movie database where n1, n2, n3 are three customers who has watched exactly same movies, same number of times.

So Do I have to consider them separately? OR should I consider them as one data point and look for 2 more unique data points?

Upvotes: 0

Views: 2996

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Neither is more correct than the other.

Mathematically it is common to assume points with identical features to be the same point. But then it may have multiple labels and weights, so this is more expensive to handle.

Intuitively, and from a database view, the k nearest neighbors should be k objects, no matter if they are the same or not. There is more than one "President George Bush", fact. Why merge them? If you wanted more objects, you should have chosen a larger k.

Choose whichever you prefer, but do not assume everybody made the same decision.

Upvotes: 1

Varun Sankar
Varun Sankar

Reputation: 23

It depends on what you're using it for.

If you're trying to see who has watched the same movies the same number of times, then you'd want to treat them as discrete points, because although they are duplicated points, they are still the nearest neighbors.

If you want to see an aggregate of how many times a movie has been watched, then duplicated points should be treated as one point.

Hope this helps, --Varun

Upvotes: 0

Related Questions