RedMassiveStar
RedMassiveStar

Reputation: 179

KNN Algorithm in Weka Never Completing On Large Dataset

back with a question on datamining and working with Weka and WekaSharp on datamining. Through WekaSharp I have been doing some analysis on a fairly large dataset which is the KDD Cup 1999 10% database ( ~70 mb). I have had good results with the decision tree J48 algorithm and the Naive Bayes algorithm each taking between 10 and 30 min to complete. When I run this same data through the KNN algorithm and it never finishes the analysis, it does not error out it simply runs forever. I have tried all different parameters with no effect. When I run the same KNN algorithm on a smaller sample dataset such as the iris.arff it finishes with no difficulty. Here is the setup I have for the KNN parameters: "-K 1 -W 0 -A \"weka.core.neighboursearch.KDTree -A \\"weka.core.EuclideanDistance -R first-last\\"\"" Is there an inherent issue with KNN and large datasets or is there a setup issue? Thank you very much.

Upvotes: 2

Views: 887

Answers (1)

Sneftel
Sneftel

Reputation: 41474

kNN is subject to the "curse of dimensionality": spatial queries of high-dimensional datasets cannot be effectively optimized in the same way lower-dimensional datasets can, turning them effectively into brute-force searches.

NB laughs at dimensionality because it basically ignores dimensions. Many decision tree variants are also fairly good at dealing with high-dimensional data. kNN does not like high-dimensional data. Expect to wait for a long time.

Upvotes: 2

Related Questions