user3203820
user3203820

Reputation: 21

RandomForest in R with Large amount of features

I have about 10,000 samples and 9,000 features. I am trying to use RandomForest (RF or GRF) for feature (variable) selection/reduction.

The concept works great when I use 700 features, but for 9,000, when I try to run randomForest or RRF, even with 1 tree (and even with mtry=1), I wait for hours and nothing happens. (FYI, I use sampsize=800)

I was hoping at least to be able to run 1 single tree, and then to use multi computers and to combine.

Any ideas to assist ?

Roni

Upvotes: 2

Views: 819

Answers (1)

Ricardo Cristian Ramirez
Ricardo Cristian Ramirez

Reputation: 1234

I have been dealing with the same problem and I solved like below:

  1. Divide your 9000 features to say 9 groups, each contains 1000 features for 10K samples
  2. Run feature selection for each subgroup and select say 300 most informative features from each subgroup
  3. Combine selected 9*300 features and repeat step 1 and 2
  4. Finally you will get 300 features selected

This approach may cause loss of some important features but it generally selects the most informative features. By the way, you can change selected feature size (300 in given example) as your needs.

As far as I can find out, there is no other way than brute force to find best feature subset without the probability of losing an important feature.

Upvotes: 2

Related Questions