Reputation: 5922
I am trying out scikit-learn for the first time, for a Multi-Output Multi-Class text classification problem. I am attempting to use GridSearchCV
to optimize the parameters of MLPClassifier
for this purpose.
I will admit that I am shooting in the dark here, having no prior experience. Please let me know if this makes sense.
Below is what I currently have:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
df = pd.read_csv('data.csv')
df.fillna('', inplace=True) #Replaces NaNs with "" in the DataFrame (which would be considered a viable choice in this multi-classification model)
x_features = df['input_text']
y_labels = df[['output_text_label_1', 'output_text_label_2']]
x_train, x_test, y_train, y_test = train_test_split(x_features, y_labels, test_size=0.3, random_state=7)
pipe = Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(MLPClassifier()))])
pipe.fit(x_train, y_train)
pipe.score(x_test, y_test)
pipe.score
gives a score of ~0.837, which seems to suggest that the above code is doing something. Running pipe.predict()
on some test strings seems to yield relatively adequate output results.
However, even after looking at plenty examples, I don't understand how to implement GridSearchCV
for this Pipeline
. (Additionally, I would like advice on which parameters to search).
I doubt it makes sense to post my attempts with GridSearchCV
since they have been varied and all unsuccessful. But a brief example from a Stack Overflow answer could be:
grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [(100,),(200,)]
}
]
grid_search = GridSearchCV(pipe, grid, scoring='accuracy', n_jobs=-1)
grid_search.fit(x_train, y_train)
This gives the error:
ValueError: Invalid parameter activation for estimator Pipeline(steps=[('cv', CountVectorizer()), ('mlpc', MultiOutputClassifier(estimator=MLPClassifier()))]). Check the list of available parameters with
estimator.get_params().keys()
.
I'm not sure what causes this, nor exactly how to utilize estimator.get_params().keys()
to figure out which parameters are faulty.
Perhaps my uses of 'cv', CountVectorizer()
or 'mlpc', MultiOutputClassifier(estimator=MLPClassifier()))
are incorrect in relation to the grid parameters.
I believe I need to use CountVectorizer()
here because my inputs (and desired label outputs) are all strings.
I very much appreciate an example of how GridSearchCV
should be used for a Pipeline
presumably utilizing CountVectorizer()
and MLPClassifier
in the correct way, and which grid parameters may be advisable to search.
Upvotes: 3
Views: 2566
Reputation: 1099
TL;DR Try something like this:
mlpc = MLPClassifier(solver='adam',
learning_rate_init=0.01,
max_iter=300,
activation='relu',
early_stopping=True)
pipe = Pipeline(steps=[('cv', CountVectorizer(ngram_range=(1, 1))),
('scale', StandardScaler()),
('mlpc', MultiOutputClassifier(mlpc))])
search_space = {
'cv__max_df': (0.9, 0.95, 0.99),
'cv__min_df': (0.01, 0.05, 0.1),
'mlpc__estimator__alpha': 10.0 ** -np.arange(1, 5),
'mlpc__estimator__hidden_layer_sizes': ((64, 32), (128, 64),
(64, 32, 16), (128, 64, 32)),
'mlpc__estimator__tol': (1e-3, 5e-3, 1e-4),
}
Discussion:
MLPClassifier
supports multi-output classification, and having interrelating outputs, I wouldn't recommend using MultiOutputClassifier
as it trains separate MLPClassifier
instances without taking into account the relationship between outputs. Training only one MLPClassifier
is faster, cheaper, and usually more accurate.ValueError
is due to improper parameter grid names. See Nested parameters.solver='adam'
to use a cheaper, first-order method as opposed to a second-order 'lbfgs'
. Alternatively, try solver='sgd'
---even cheaper to compute---but then also tune momentum
. I anticipate that your data will be sparse and of different scales after CountVectorizer
, and momentum
/solver='adam'
is a way to tackle variant gradients.StandardScaler
will work better) after CountVectorizer
as MLPs are sensitive to feature scaling. Although, solver='adam'
would probably handle imbalanced bag of words well. Still, I believe it won't hurt to standardize your data.activation
is needles. Set activation='relu'
.early_stopping=True
, specify a large enough max_iter
, and tune tol
to prevent overfitting.learning_rate_init
with solver='sgd'
; for solver='adam'
, I assume higher learning rates will be OK and adam
won't require comprehensive learning-rate tuning.hidden_layer_sizes=(128, 64, 32)
to hidden_layer_sizes=(256, 192)
).alpha
.hidden_layer_sizes
may depend on a document-term dimension.batch_size
s but take into account computational expenses.CountVectorizer
, tune max_df
and min_df
but not ngram_range
; I believe at least a two-layer MLP will handle unigram relationships itself in hidden layers without need to process n-grams explicitly.Disclaimer: Most of the remarks are based on my (insubstantialš¤) assumptions about your data and pertain only to scikit-learn's MLPs. Refer to docs to learn more about neural networks and experiment with other tips. And remember, There is No Free Lunch.
Upvotes: 3