Colin
Colin

Reputation: 303

k-nearest neighbors where # of objects in each class differs vastly

I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).

How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.

I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?

Upvotes: 0

Views: 47

Answers (1)

Guy haimovitz
Guy haimovitz

Reputation: 1625

Ther is no such rule, for your case i would try a very small k probably between 3 and 6.

About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).

Upvotes: 1

Related Questions