k-nearest neighbors where # of objects in each class differs vastly

Question

I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).

How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.

I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?

Guy haimovitz · Accepted Answer

Ther is no such rule, for your case i would try a very small k probably between 3 and 6.

About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).

k-nearest neighbors where # of objects in each class differs vastly

Answers (1)

Related Questions