Writing best GridSearch classifiers into a table

Question

I found and successfully tested following script that applies Pipeline and GridSearchCV to classifier selection. The script outputs the best classifier and its accuracy.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data

#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)},
                {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators': [10, 100, 1000],
                 'classifier__max_features': [1, 2, 3]}]

# Create grid search 
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)

# Fit grid search
best_model = clf.fit(X_train, y_train)

print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))

from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred, 
average='weighted'))

How can I adjust the script so that it not only outputs the best classifier, which is LogReg in our example, but also the best selected among the other classifiers? Above, I like to see the output from RandomForestClassifier(), too.

Ideal is a solution where the best classifier for each algorithm (LogReg, RandomForest,..) is shown and where each of those best classifiers is sorted into a table. The first column or index should be the model and precision_recall_fscore_support values are in rows on the right. The table should then be sorted by F-score.

PS: Though the script works, I'm yet unsure what the function of LogisticRegression() in the Pipeline is, as it's defined in the search space later.

Solution (simplified):

from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]

seed=1
models = [
            'RFC',
            'logisticRegression'
         ]
clfs = [
        RandomForestClassifier(random_state=seed,n_jobs=-1),
        LogisticRegression()
        ]

params = {
            models[0]:{'n_estimators':[100]},
            models[1]: {'C':[1000]}
         }


for name, estimator in zip(models,clfs):

    print(name)

    clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)

    clf.fit(X_train, y_train)

    print("best params: " + str(clf.best_params_))
    print("best scores: " + str(clf.best_score_))

    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    print("Accuracy: {:.4%}".format(acc))
    print(classification_report(y_test, y_pred, digits=4))

Writing best GridSearch classifiers into a table

Answers (1)

If I understood correctly, this should work fine.

Related Questions