Reputation: 41
I'm trying to found a set of best hyperparameters for my Logistic Regression estimator with Grid Search CV and build the model using pipeline:
my problem is when trying to use the best parameters I get through
grid_search.best_params_
to build the Logistic Regression model, the accuracy is different from the one I get by
grid_search.best_score_
Here is my code
x=tweet["cleaned"]
y=tweet['tag']
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=.20, random_state=42)
pipeline = Pipeline([
('vectorizer',TfidfVectorizer()),
('chi', SelectKBest()),
('classifier', LogisticRegression())])
grid = {
'vectorizer__ngram_range': [(1, 1), (1, 2),(1, 3)],
'vectorizer__stop_words': [None, 'english'],
'vectorizer__norm': ('l1', 'l2'),
'vectorizer__use_idf':(True, False),
'vectorizer__analyzer':('word', 'char', 'char_wb'),
'classifier__penalty': ['l1', 'l2'],
'classifier__C': [1.0, 0.8],
'classifier__class_weight': [None, 'balanced'],
'classifier__n_jobs': [-1],
'classifier__fit_intercept':(True, False),
}
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train,Y_train)
and when I get best score and pram using
print(grid_search.best_score_)
print(grid_search.best_params_)
the result is
0.7165160230073953
{'classifier__C': 1.0, 'classifier__class_weight': None, 'classifier__fit_intercept': True, 'classifier__n_jobs': -1, 'classifier__penalty': 'l1', 'vectorizer__analyzer': 'word', 'vectorizer__ngram_range': (1, 1), 'vectorizer__norm': 'l2', 'vectorizer__stop_words': None, 'vectorizer__use_idf': False}
Now if I use these parameters to build my model
pipeline = Pipeline([
('vectorizer',TfidfVectorizer(ngram_range=(1, 1),stop_words=None,norm='l2',use_idf= False,analyzer='word')),
('chi', SelectKBest(chi2,k=1000)),
('classifier', LogisticRegression(C=1.0,class_weight=None,fit_intercept=True,n_jobs=-1,penalty='l1'))])
model=pipeline.fit(X_train,Y_train)
print(accuracy_score(Y_test, model.predict(X_test)))
the result drops to 0.68.
also, it is tedious work, so how can I pass the best parameters to model. I could not figure out how to do it like in this(answer) since my way is slightly different than him.
Upvotes: 3
Views: 2696
Reputation: 4243
I put both Logistic Regression and MLPClassifier in a pipeline switching between each classifier. I used GridSearchCV to find the best parameters between the classifiers. I adjusted the parameters then selected the most accurate classifier for the data. Originally the MLPClassifier was more accurate but after adjusting the C value for the logistic regression, it became more accurate.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=42)
pipeline= Pipeline([
('scaler',StandardScaler()),
#('pca', PCA()),
('clf',LogisticRegression(C=5,max_iter=10000, tol=0.1)),
#('clf',MLPClassifier(hidden_layer_sizes=(25,150,25), max_iter=800, solver='lbfgs', activation='relu', alpha=0.7,
# learning_rate_init=0.001, verbose=False, momentum=0.9, random_state=42))
])
pipeline.fit(X_train,y_train)
parameter_grid={'C':np.linspace(5,100,5)}
grid_rf_class=GridSearchCV(
estimator=pipeline['clf'],
param_grid=parameter_grid,
scoring='roc_auc',
n_jobs=2,
cv=5,
refit=True,
return_train_score=True)
grid_rf_class.fit(X_train,y_train)
predictions=grid_rf_class.predict(X_test)
print(accuracy_score(y_test,predictions));
print(grid_rf_class.best_params_)
print(grid_rf_class.best_score_)
Upvotes: 0
Reputation: 4211
The reason why your score is lower in the second option is because you are evaluating your pipeline model on the test set, whereas you are evaluating your gridsearch model using cross-validation (in your case, a 10-fold stratified cross-validation). This cross-validation score is the average of 10 models fitted each on 9/10 of your train data and evaluated on the last 1/10 of this train data. Hence, you cannot expect the same score from both evaluations.
As far your second question, why can't you just do grid_search.best_estimator_
? This takes the best model from your grid search and you can evaluate it without rebuilding it from scratch. For instance:
best_model = grid_search.best_estimator_
best_model.score(X_test, Y_test)
Upvotes: 7