Reputation: 51
I've been experimenting with the sklearn grid search and pipeline functionality and have noticed that the f1_score returned does not match the f1_score I generate using hard coded parameters. Looking for help understanding why this may be.
Data background: two column .csv file
customer comment (string), category tag (string)
Using out of the box sklearn bag of words approach with no pre-processing of text, just the countVectorizer.
Hard coded model...
get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)
#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)
#split dataFrame into two series
comment_data = data['comment']
tag_data = data['tag']
#split data into test and train samples
comment_train, comment_test, tag_train, tag_test = train_test_split(
comment_data, tag_data, test_size=0.33)
#build count vectorizer
vectorizer = CountVectorizer(min_df=.002,analyzer='word',stop_words='english',strip_accents='unicode')
vectorizer.fit(comment_data)
#vectorize features and convert to array
comment_train_features = vectorizer.transform(comment_train).toarray()
comment_test_features = vectorizer.transform(comment_test).toarray()
#train LinearSVM Model
lin_svm = LinearSVC()
lin_svm = lin_svm.fit(comment_train_features,tag_train)
#make predictions
lin_svm_predicted_tags = lin_svm.predict(comment_test_features)
#score models
lin_svm_score = round(f1_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_accur = round(accuracy_score(tag_test,lin_svm_predicted_tags),3)
lin_svm_prec = round(precision_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_recall = round(recall_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
#write out scores
print('Model f1Score Accuracy Precision Recall')
print('------ ------- -------- --------- ------')
print('LinSVM {f1:.3f} {ac:.3f} {pr:.3f} {re:.3f} '.format(f1=lin_svm_score,ac=lin_svm_accur,pr=lin_svm_prec,re=lin_svm_recall))
The f1_score output is generally around 0.86 (depending on random seed value)
Now if I basically reconstruct the same output with grid search and pipeline...
#get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)
#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)
#build processing pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', LinearSVC()),])
#define parameters to be used in gridsearch
parameters = {
#'vect__min_df': (.001,.002,.003,.004,.005),
'vect__analyzer': ('word',),
'vect__stop_words': ('english', None),
'vect__strip_accents': ('unicode',),
#'clf__C': (1,10,100,1000),
}
if __name__ == '__main__':
grid_search = GridSearchCV(pipeline,parameters,scoring='f1_macro',n_jobs=1)
grid_search.fit(data['comment'],data['tag'])
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_params = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_params[param_name]))
The returned f1_score is closer to 0.73, with all model parameters the same. My understanding is that grid search applies a cross-val approach internally, but my guess is that the difference comes from whatever approach it is using as compared to by use of test_train_split in the original code. However a drop from 0.83 -> 0.73 feels large to me and I would like to be confident in my results.
Any insight would be greatly appreciated.
Upvotes: 1
Views: 563
Reputation: 16109
In the code you provided you are not setting the random_state
parameter of the LinearSVC
model, thus even with the same hyperparameters, you will be unlikely to reproduce an exact duplicate of the best estimator from your GridSearchCV. However, that is more trivial than what is really going on.
The GridSearch is being cross validated using in your case 3 folds of the data. The best_score that you see is the score from the model that performed best on average across all of your folds when scored on your test data, and it may not be the estimator with the best score on your train/test split. It is possible that given the split you provided the GridSearch a different estimator would score higher, but if you were to generate a handful of different splits and score the estimators on each of the test sets, on average the best_estimator
will come out on top. The idea is that by cross validating you will choose an estimator that is more resilient to changes in the data not necessarily represented in a single train/test split. So the more splits you take the better your model will perform on new unseen data. In this case better may not mean that it produces a more accurate result every time, but that given the variations present in the existing data the model will do a better job encompassing these variations and on average produce a more accurate result in the long run as long as new unseen data falls within what was seen in the training data.
If you want to see more info about how the estimators performed within splits, take a look at grid_search.cv_results_
for a better picture of what happened step by step through the process.
Upvotes: 1