Ensuring right order of operations in random forest classification in scikit learn

Question

I would like to ensure that the order of operations for my machine learning is right:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV

# 1. Initialize model
model = RandomForestClassifier(5000)

# 2. Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# 3. Remove unimportant features
model = SelectFromModel(model, threshold=0.5).estimator

# 4. cross validate model on the important features
k_fold = KFold(n=len(data), n_folds=10, shuffle=True)
for k, (train, test) in enumerate(k_fold):
    self.model.fit(data[train], target[train])

# 5. grid search for best parameters
param_grid = {
    'n_estimators': [1000, 2500, 5000],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [3, 5, data.shape[1]]
}

gs = GridSearchCV(estimator=model, param_grid=param_grid)
gs.fit(X, y)
model = gs.best_estimator_

# Now the model can be used for prediction

Please let me know if this order looks good or if something can be done to improve it.

--EDIT, clarifying to reduce downvotes.

Specifically, 1. Should the SelectFromModel be done after cross validation?

Should grid search be done before cross validation?

David Maust · Accepted Answer

The main problem with your approach is you are confusing the feature selection transformer with the final estimator. What you will need to do is create two stages, the transformer first:

rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)

Then you need a second phase where you use the reduced feature set to train a classifier on the reduced feature set.

clf = RandomForestClassifier(5000)

Once you have your phases, you can build a pipeline to combine the two into a final model.

model = Pipeline([
          ('fs', feat_selection), 
          ('clf', clf), 
        ])

Now you can perform a GridSearch on your model. Keep in mind you have two stages, so the parameters must be specified by stage fs or clf. In terms of the feature selection stage, you can also access the base estimator using fs__estimator. Below is an example of how to search parameters on any of the three objects.

 params = {
    'fs__threshold': [0.5, 0.3, 0.7],
    'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
    'clf__max_features': ['auto', 'sqrt', 'log2'],
 }

 gs = GridSearchCV(model, params, ...)
 gs.fit(X,y)

You can then make predictions with gs directly or using gs.best_estimator_.

Ensuring right order of operations in random forest classification in scikit learn

Answers (1)

Related Questions