How to use GridSearchCV with MultiOutputClassifier(MLPClassifier) Pipeline

Question

I am trying out scikit-learn for the first time, for a Multi-Output Multi-Class text classification problem. I am attempting to use GridSearchCV to optimize the parameters of MLPClassifier for this purpose.

I will admit that I am shooting in the dark here, having no prior experience. Please let me know if this makes sense.

Below is what I currently have:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

df = pd.read_csv('data.csv')

df.fillna('', inplace=True) #Replaces NaNs with "" in the DataFrame (which would be considered a viable choice in this multi-classification model)

x_features = df['input_text']
y_labels = df[['output_text_label_1', 'output_text_label_2']]

x_train, x_test, y_train, y_test = train_test_split(x_features, y_labels, test_size=0.3, random_state=7)

pipe = Pipeline(steps=[('cv', CountVectorizer()),
                       ('mlpc', MultiOutputClassifier(MLPClassifier()))])

pipe.fit(x_train, y_train)

pipe.score(x_test, y_test)

pipe.score gives a score of ~0.837, which seems to suggest that the above code is doing something. Running pipe.predict() on some test strings seems to yield relatively adequate output results.

However, even after looking at plenty examples, I don't understand how to implement GridSearchCV for this Pipeline. (Additionally, I would like advice on which parameters to search).

I doubt it makes sense to post my attempts with GridSearchCV since they have been varied and all unsuccessful. But a brief example from a Stack Overflow answer could be:

grid = [
        {
        'activation' : ['identity', 'logistic', 'tanh', 'relu'],
        'solver' : ['lbfgs', 'sgd', 'adam'],
        'hidden_layer_sizes': [(100,),(200,)]
        }
       ]

grid_search = GridSearchCV(pipe, grid, scoring='accuracy', n_jobs=-1)

grid_search.fit(x_train, y_train)

This gives the error:

ValueError: Invalid parameter activation for estimator Pipeline(steps=[('cv', CountVectorizer()), ('mlpc', MultiOutputClassifier(estimator=MLPClassifier()))]). Check the list of available parameters with estimator.get_params().keys().

I'm not sure what causes this, nor exactly how to utilize estimator.get_params().keys() to figure out which parameters are faulty.

Perhaps my uses of 'cv', CountVectorizer() or 'mlpc', MultiOutputClassifier(estimator=MLPClassifier())) are incorrect in relation to the grid parameters.

I believe I need to use CountVectorizer() here because my inputs (and desired label outputs) are all strings.

I very much appreciate an example of how GridSearchCV should be used for a Pipeline presumably utilizing CountVectorizer() and MLPClassifier in the correct way, and which grid parameters may be advisable to search.

Sanjar Adilov · Accepted Answer

TL;DR Try something like this:

mlpc = MLPClassifier(solver='adam',
                     learning_rate_init=0.01,
                     max_iter=300,
                     activation='relu',
                     early_stopping=True)
pipe = Pipeline(steps=[('cv', CountVectorizer(ngram_range=(1, 1))),
                       ('scale', StandardScaler()),
                       ('mlpc', MultiOutputClassifier(mlpc))])
search_space = {
    'cv__max_df': (0.9, 0.95, 0.99),
    'cv__min_df': (0.01, 0.05, 0.1),
    'mlpc__estimator__alpha': 10.0 ** -np.arange(1, 5),
    'mlpc__estimator__hidden_layer_sizes': ((64, 32), (128, 64),
                                            (64, 32, 16), (128, 64, 32)),
    'mlpc__estimator__tol': (1e-3, 5e-3, 1e-4),
}

Discussion:

Disclaimer: Most of the remarks are based on my (insubstantial🤔) assumptions about your data and pertain only to scikit-learn's MLPs. Refer to docs to learn more about neural networks and experiment with other tips. And remember, There is No Free Lunch.

How to use GridSearchCV with MultiOutputClassifier(MLPClassifier) Pipeline

Answers (1)

Related Questions