Reputation: 393
I'm working on a Random Forest classifier model to classify data into 4 categories. The datasets are concatenated logs from two different firewall apps with 100K+ samples and 10 attributes. The accuracy with sample datasets from the training set is 0.90-0.97.
With the test data, I'm getting 0.7-0.97 for logs from 1 of the firewalls, and ~0.5 (always) for the other. Should I be looking into tuning the RF or is something wrong entirely with my model? "Text field1" is to be predicted.
y,label = pd.factorize(train["Text field1"])
clf=RandomForestClassifier(n_jobs=10,n_estimators=100,max_features=1,max_depth=80)
clf.fit(train[features],y)
pred=clf.predict(test[features])
pred_label=label[pred]
`
I'm a beginner at this, any help is appreciated. Thanks!
Upvotes: 0
Views: 1547
Reputation: 3785
There are two problems I see here. One is, like Rachel said, you've definitely been over-fitting your data. 80 is a really deep tree! That would give each node 2^80 possible leaves, or 1 followed by 24 zeros! Since you only have 100k+ samples, you're definitely giving a perfect fit on each tree to its respective bootstrap of the training data. Once you have enough depth to do this, further increases in depth limit doesn't do anything, and you're significantly past that point. This is undesirable.
Since even a (balanced) tree of depth 2^17 is 130k leaf nodes, you should look at some depths that are shallower than 17. Once you have a reasonable depth, max_features=1
will probably no longer be optimal. You should also test some different values for that.
The other issue you've brought up is that you have different performance on the two firewalls. There are a couple possible reasons for this:
clf.oob_score_
).If training your model is fast, you may find that GridSearchCV
will help with your parameter selection (that's what it's designed for). It will automatically test on different combinations of parameters. Keep in mind if you test on N depths and M max_features
, you get N*M possibilities, so it's good to sparsely sample these first (maybe depth of 18, 12, 8 and max_features
of 2, 5, 8 to begin with), and then you can run it again with values closer to the optimal set you found the first time.
Upvotes: 1
Reputation: 86
Your model is overfitting to the training set. You should reduce the variance / increase the bias of your model, either by:
Upvotes: 1