Reputation: 21961
I would like to ensure that the order of operations for my machine learning is right:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV
# 1. Initialize model
model = RandomForestClassifier(5000)
# 2. Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# 3. Remove unimportant features
model = SelectFromModel(model, threshold=0.5).estimator
# 4. cross validate model on the important features
k_fold = KFold(n=len(data), n_folds=10, shuffle=True)
for k, (train, test) in enumerate(k_fold):
self.model.fit(data[train], target[train])
# 5. grid search for best parameters
param_grid = {
'n_estimators': [1000, 2500, 5000],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [3, 5, data.shape[1]]
}
gs = GridSearchCV(estimator=model, param_grid=param_grid)
gs.fit(X, y)
model = gs.best_estimator_
# Now the model can be used for prediction
Please let me know if this order looks good or if something can be done to improve it.
--EDIT, clarifying to reduce downvotes.
Specifically,
1. Should the SelectFromModel
be done after cross validation?
Upvotes: 0
Views: 1090
Reputation: 8270
The main problem with your approach is you are confusing the feature selection transformer with the final estimator. What you will need to do is create two stages, the transformer first:
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
Then you need a second phase where you use the reduced feature set to train a classifier on the reduced feature set.
clf = RandomForestClassifier(5000)
Once you have your phases, you can build a pipeline to combine the two into a final model.
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
Now you can perform a GridSearch
on your model
. Keep in mind you have two stages, so the parameters must be specified by stage fs
or clf
. In terms of the feature selection stage, you can also access the base estimator using fs__estimator
. Below is an example of how to search parameters on any of the three objects.
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
You can then make predictions with gs
directly or using gs.best_estimator_
.
Upvotes: 6