Parvathy Sarat
Parvathy Sarat

Reputation: 393

Tuning Random Forest classifier

I'm working on a Random Forest classifier model to classify data into 4 categories. The datasets are concatenated logs from two different firewall apps with 100K+ samples and 10 attributes. The accuracy with sample datasets from the training set is 0.90-0.97.

With the test data, I'm getting 0.7-0.97 for logs from 1 of the firewalls, and ~0.5 (always) for the other. Should I be looking into tuning the RF or is something wrong entirely with my model? "Text field1" is to be predicted.

y,label = pd.factorize(train["Text field1"])

clf=RandomForestClassifier(n_jobs=10,n_estimators=100,max_features=1,max_depth=80)
clf.fit(train[features],y)
pred=clf.predict(test[features])
pred_label=label[pred]

`

I'm a beginner at this, any help is appreciated. Thanks!

Upvotes: 0

Views: 1547

Answers (2)

Jeremy McGibbon
Jeremy McGibbon

Reputation: 3785

There are two problems I see here. One is, like Rachel said, you've definitely been over-fitting your data. 80 is a really deep tree! That would give each node 2^80 possible leaves, or 1 followed by 24 zeros! Since you only have 100k+ samples, you're definitely giving a perfect fit on each tree to its respective bootstrap of the training data. Once you have enough depth to do this, further increases in depth limit doesn't do anything, and you're significantly past that point. This is undesirable.

Since even a (balanced) tree of depth 2^17 is 130k leaf nodes, you should look at some depths that are shallower than 17. Once you have a reasonable depth, max_features=1 will probably no longer be optimal. You should also test some different values for that.

The other issue you've brought up is that you have different performance on the two firewalls. There are a couple possible reasons for this:

  • If you're training on one firewall, and testing on the other, you expect the model to only do well on the parts of the datasets that are similar. It's best to do a train/test split within (independent) data from the same firewall, or since you're using a random forest just look at the out-of-bag performance (stored at clf.oob_score_).
  • If you're already doing this, the non-optimal parameters you've been using might be having different impacts on the two datasets. For example, if all of the data for firewall 1 is kind of similar, over-fitting won't degrade performance on the testing data you've chosen, while if firewall 2 has many outlier regimes, over-fitting will greatly degrade performance. If you're in this scenario, fixing the first problem should help with this second one.

If training your model is fast, you may find that GridSearchCV will help with your parameter selection (that's what it's designed for). It will automatically test on different combinations of parameters. Keep in mind if you test on N depths and M max_features, you get N*M possibilities, so it's good to sparsely sample these first (maybe depth of 18, 12, 8 and max_features of 2, 5, 8 to begin with), and then you can run it again with values closer to the optimal set you found the first time.

Upvotes: 1

Rachel Kogan
Rachel Kogan

Reputation: 86

Your model is overfitting to the training set. You should reduce the variance / increase the bias of your model, either by:

  1. increase randomness: consider a smaller percentage of features for each split
  2. limit the number of splits each tree can make

Upvotes: 1

Related Questions