Reputation: 787
I have a fairly simple NLTK and sklearn classifier (I'm a complete noob at this).
I do the usual imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
I load the data (I already cleaned it). It is a very simple dataframe with two columns. The first is 'post_clean'
which contains the cleaned text, the second is 'uk'
which is either True
or False
data = pd.read_pickle('us_uk_posts.pkl')
Then I Vectorize with tfidf and split the dataset, followed by creating the model
tf = TfidfVectorizer()
text_tf = tf.fit_transform(data['post_clean'])
X_train, X_test, y_train, y_test = train_test_split(text_tf, data['uk'], test_size=0.3, random_state=123)
clf = MultinomialNB().fit(X_train, y_train)
predicted = clf.predict(X_test)
print("MultinomialNB Accuracy:" , metrics.accuracy_score(y_test,predicted))
Apparently, unless I'm completely missing something here, I have Accuracy of 93%
My two questions are:
1) How do I now use this model to actually classify some items that don't have a known UK
value?
2) How do I test this model using a completely separate test set (that I haven't split)?
I have tried
new_data = pd.read_pickle('new_posts.pkl')
Where the new_posts data is in the same format
new_text_tf = tf.fit_transform(new_data['post_clean'])
predicted = clf.predict(new_X_train)
predicted
and
new_text_tf = tf.fit_transform(new_data['post_clean'])
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(new_text_tf, new_data['uk'], test_size=1)
predicted = clf.predict(new_text_tf)
predicted
but both return "ValueError: dimension mismatch"
Upvotes: 6
Views: 7306
Reputation: 1882
Once you have extracted the vocabulary to generate the sparse vectors during training using tf.fit_transform(), you need to use tf.transform() instead of fit_transform(). So the features for the test set should be
new_text_tf = tf.transform(new_data['post_clean'])
When you use tf.fit_transform() on your test / new data, it extracts a new vocabulary based on the words in your test data which are likely different than your training data. The difference in the vocabulary generates the dimension mismatch error.
You should also combine both your test data and training data into one master set and then run the fit_transform() on this master set so that even the words that are only in the test set are captured in your vectorizer. The rest of your code can stay the same. Doing this could improve your accuracy if you have words in the test set that are not in the training set.
Upvotes: 4