Reputation: 2232
I found and successfully tested following script that applies Pipeline and GridSearchCV to classifier selection. The script outputs the best classifier and its accuracy.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data
#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])
# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': np.logspace(0, 4, 10)},
{'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100, 1000],
'classifier__max_features': [1, 2, 3]}]
# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# Fit grid search
best_model = clf.fit(X_train, y_train)
print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))
from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred,
average='weighted'))
How can I adjust the script so that it not only outputs the best classifier, which is LogReg in our example, but also the best selected among the other classifiers? Above, I like to see the output from RandomForestClassifier()
, too.
Ideal is a solution where the best classifier for each algorithm (LogReg, RandomForest,..) is shown and where each of those best classifiers is sorted into a table. The first column or index should be the model and precision_recall_fscore_support
values are in rows on the right. The table should then be sorted by F-score.
PS: Though the script works, I'm yet unsure what the function of LogisticRegression()
in the Pipeline is, as it's defined in the search space later.
Solution (simplified):
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]
seed=1
models = [
'RFC',
'logisticRegression'
]
clfs = [
RandomForestClassifier(random_state=seed,n_jobs=-1),
LogisticRegression()
]
params = {
models[0]:{'n_estimators':[100]},
models[1]: {'C':[1000]}
}
for name, estimator in zip(models,clfs):
print(name)
clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
print("best params: " + str(clf.best_params_))
print("best scores: " + str(clf.best_score_))
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.4%}".format(acc))
print(classification_report(y_test, y_pred, digits=4))
Upvotes: 1
Views: 1173
Reputation: 33127
import pandas as pd
import numpy as np
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the test_score of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
# The first line contains the best model and its parameters
df_final.to_csv('sorted_table.csv')
# OR to avoid the index in the writting
df_final.to_csv('sorted_table2.csv',index=False)
Results:
However, in this case, the ordering is not done based on the F values. To do so use this. Define in the GridSearch
the scoring
attribute to f1_weighted
and repeat my code.
Example:
...
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0,scoring='f1_weighted')
best_model = clf.fit(X_train, y_train)
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the F values of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
df_final.to_csv('F_sorted_table.csv')
Upvotes: 2