Fastest way to reduce dimensionality for multi-classification in R

Question

What I currently have:

I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.

What I have tried:

First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried: Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.

What I need:

I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.

knb · Accepted Answer

I have done only some toy examples too.

Still, here are some ideas that do not fit into a comment.

All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?

If you absolutely MUST do attribute selection .e.g as part of an assignment:

Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.

Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.

The videos assume familiarity with Weka, but maybe it still helps. There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Fastest way to reduce dimensionality for multi-classification in R

Answers (1)

Related Questions