Charlie Morton
Charlie Morton

Reputation: 787

How to use sklearn TfidfVectorizer on new data

I have a fairly simple NLTK and sklearn classifier (I'm a complete noob at this).

I do the usual imports

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

from sklearn.feature_extraction.text import TfidfVectorizer

I load the data (I already cleaned it). It is a very simple dataframe with two columns. The first is 'post_clean' which contains the cleaned text, the second is 'uk' which is either True or False

data = pd.read_pickle('us_uk_posts.pkl')

Then I Vectorize with tfidf and split the dataset, followed by creating the model

tf = TfidfVectorizer()
text_tf = tf.fit_transform(data['post_clean'])
X_train, X_test, y_train, y_test = train_test_split(text_tf, data['uk'], test_size=0.3, random_state=123)


clf = MultinomialNB().fit(X_train, y_train)
predicted = clf.predict(X_test)
print("MultinomialNB Accuracy:" , metrics.accuracy_score(y_test,predicted))

Apparently, unless I'm completely missing something here, I have Accuracy of 93%

My two questions are:

1) How do I now use this model to actually classify some items that don't have a known UK value?

2) How do I test this model using a completely separate test set (that I haven't split)?

I have tried

new_data = pd.read_pickle('new_posts.pkl')

Where the new_posts data is in the same format

new_text_tf = tf.fit_transform(new_data['post_clean'])

predicted = clf.predict(new_X_train)
predicted

and

new_text_tf = tf.fit_transform(new_data['post_clean'])

new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(new_text_tf, new_data['uk'], test_size=1)

predicted = clf.predict(new_text_tf)
predicted

but both return "ValueError: dimension mismatch"

Upvotes: 6

Views: 7306

Answers (1)

Adnan S
Adnan S

Reputation: 1882

Once you have extracted the vocabulary to generate the sparse vectors during training using tf.fit_transform(), you need to use tf.transform() instead of fit_transform(). So the features for the test set should be

new_text_tf = tf.transform(new_data['post_clean'])

When you use tf.fit_transform() on your test / new data, it extracts a new vocabulary based on the words in your test data which are likely different than your training data. The difference in the vocabulary generates the dimension mismatch error.

You should also combine both your test data and training data into one master set and then run the fit_transform() on this master set so that even the words that are only in the test set are captured in your vectorizer. The rest of your code can stay the same. Doing this could improve your accuracy if you have words in the test set that are not in the training set.

Upvotes: 4

Related Questions