Christopher
Christopher

Reputation: 2232

Writing best GridSearch classifiers into a table

I found and successfully tested following script that applies Pipeline and GridSearchCV to classifier selection. The script outputs the best classifier and its accuracy.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data

#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)},
                {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators': [10, 100, 1000],
                 'classifier__max_features': [1, 2, 3]}]

# Create grid search 
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)

# Fit grid search
best_model = clf.fit(X_train, y_train)

print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))

from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred, 
average='weighted'))

How can I adjust the script so that it not only outputs the best classifier, which is LogReg in our example, but also the best selected among the other classifiers? Above, I like to see the output from RandomForestClassifier(), too.

Ideal is a solution where the best classifier for each algorithm (LogReg, RandomForest,..) is shown and where each of those best classifiers is sorted into a table. The first column or index should be the model and precision_recall_fscore_support values are in rows on the right. The table should then be sorted by F-score.

PS: Though the script works, I'm yet unsure what the function of LogisticRegression() in the Pipeline is, as it's defined in the search space later.

Solution (simplified):

from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]

seed=1
models = [
            'RFC',
            'logisticRegression'
         ]
clfs = [
        RandomForestClassifier(random_state=seed,n_jobs=-1),
        LogisticRegression()
        ]

params = {
            models[0]:{'n_estimators':[100]},
            models[1]: {'C':[1000]}
         }


for name, estimator in zip(models,clfs):

    print(name)

    clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)

    clf.fit(X_train, y_train)

    print("best params: " + str(clf.best_params_))
    print("best scores: " + str(clf.best_score_))

    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    print("Accuracy: {:.4%}".format(acc))
    print(classification_report(y_test, y_pred, digits=4))

Upvotes: 1

Views: 1173

Answers (1)

seralouk
seralouk

Reputation: 33127

If I understood correctly, this should work fine.

import pandas as pd
import numpy as np

df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the test_score of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])

# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]

# The first line contains the best model and its parameters
df_final.to_csv('sorted_table.csv')

# OR to avoid the index in the writting 
df_final.to_csv('sorted_table2.csv',index=False)

Results:

See here


However, in this case, the ordering is not done based on the F values. To do so use this. Define in the GridSearch the scoring attribute to f1_weighted and repeat my code.


Example:

...
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0,scoring='f1_weighted')

best_model = clf.fit(X_train, y_train)

df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']

# The sorting is done based on the F values of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])

# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]

df_final.to_csv('F_sorted_table.csv')

Results: Here


Upvotes: 2

Related Questions