Reputation: 61
I ran across an example on parameters tuning with Grid search and text data using TfidfVectorizer() in the pipeline. As far as I've understood is that when we call grid_search.fit(X_train, y_train) it will transform the data then fit the model as it is described in a dictionary. However during the evaluation, I'm a bit confused with the test dataset, since when we call grid_search.predict(X_test) I don't know whether/(how) the TfidfVectorizer() is applied on this test chunk.
Thanks
David
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_
score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,
verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('data/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)
Upvotes: 0
Views: 71
Reputation: 349
This is an example of scikit-learn pipelines magic. It works like this:
Pipeline
constructor - all data, whether on train or test (predict
) stage, will be processed through all the defined steps - in this case by TfidfVectorizer
and then the output will be passed to LogisticRegression
model.GridSearchCV
constructor allows you to use the method fit
, that not only performs grid search but also internally sets both TfidfVectorizer
and LogisticRegression
to best found parameters, so later running predict
does so on best-found models. You can find more info on creating pipelines in scikit-learn documentation.
Upvotes: 1