sklearn grid search f1_score does not match f1_score function

Question

I've been experimenting with the sklearn grid search and pipeline functionality and have noticed that the f1_score returned does not match the f1_score I generate using hard coded parameters. Looking for help understanding why this may be.

Data background: two column .csv file

customer comment (string), category tag (string)

Using out of the box sklearn bag of words approach with no pre-processing of text, just the countVectorizer.

Hard coded model...

get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)

#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)

#split dataFrame into two series
comment_data = data['comment']
tag_data = data['tag']

#split data into test and train samples
comment_train, comment_test, tag_train, tag_test = train_test_split(
    comment_data, tag_data, test_size=0.33)

#build count vectorizer
vectorizer = CountVectorizer(min_df=.002,analyzer='word',stop_words='english',strip_accents='unicode')
vectorizer.fit(comment_data)

#vectorize features and convert to array
comment_train_features = vectorizer.transform(comment_train).toarray()
comment_test_features = vectorizer.transform(comment_test).toarray()

#train LinearSVM Model
lin_svm = LinearSVC()
lin_svm = lin_svm.fit(comment_train_features,tag_train)

#make predictions
lin_svm_predicted_tags = lin_svm.predict(comment_test_features)

#score models
lin_svm_score = round(f1_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_accur = round(accuracy_score(tag_test,lin_svm_predicted_tags),3)
lin_svm_prec = round(precision_score(tag_test,lin_svm_predicted_tags,average='macro'),3)
lin_svm_recall = round(recall_score(tag_test,lin_svm_predicted_tags,average='macro'),3)

#write out scores
print('Model    f1Score   Accuracy   Precision   Recall')
print('------   -------   --------   ---------   ------')
print('LinSVM   {f1:.3f}     {ac:.3f}      {pr:.3f}       {re:.3f}  '.format(f1=lin_svm_score,ac=lin_svm_accur,pr=lin_svm_prec,re=lin_svm_recall))

The f1_score output is generally around 0.86 (depending on random seed value)

Now if I basically reconstruct the same output with grid search and pipeline...

#get .csv data into dataFrame
data_file = 'comment_data_basic.csv'
data = pd.read_csv(data_file,header=0,quoting=3)

#remove data without 'web issue' or 'product related' tag
data = data.drop(data[(data.tag != 'WEB ISSUES') & (data.tag != 'PRODUCT RELATED')].index)

#build processing pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LinearSVC()),])

#define parameters to be used in gridsearch
parameters = {
    #'vect__min_df': (.001,.002,.003,.004,.005),
    'vect__analyzer': ('word',),
    'vect__stop_words': ('english', None),
    'vect__strip_accents': ('unicode',),
    #'clf__C': (1,10,100,1000),
}

if __name__ == '__main__':

    grid_search = GridSearchCV(pipeline,parameters,scoring='f1_macro',n_jobs=1)

    grid_search.fit(data['comment'],data['tag'])

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_params = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("	%s: %r" % (param_name, best_params[param_name]))

The returned f1_score is closer to 0.73, with all model parameters the same. My understanding is that grid search applies a cross-val approach internally, but my guess is that the difference comes from whatever approach it is using as compared to by use of test_train_split in the original code. However a drop from 0.83 -> 0.73 feels large to me and I would like to be confident in my results.

Any insight would be greatly appreciated.

sklearn grid search f1_score does not match f1_score function

Answers (1)

Related Questions