Reputation: 8562
I have the following dataframe with data:
index field1 field2 field3
1079 COMPUTER long text.... 3
Field1 is a category and field2 is a description and field3 is just an integer representation of field1.
I am using the following code to learn field2 to category mappings with sklearn:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.
X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)
It throws the following error:
ValueError: dimension mismatch
Is there a way test the model and get the score or accuracy with such dataset?
UPDATE: Adding similar transformation to the test set.
Upvotes: 1
Views: 599
Reputation: 16587
The MultinomialNB
classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF
transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].
To fix your issue change your code to something like this:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)
clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)
To enhance your code use Pipeline:
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
text_clf.predict(X_test)
Upvotes: 3
Reputation: 15568
You should only transform your test data. Not fit_transform. You fit_transform training data and only transform test data. So if you remove “fit_” on the test data, it should work.
It is better to use pipelines that will do transformation and then train/score/predict. E.g.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
model = Pipeline(steps = [
('word_vec', CountVectorizer()),
('word_tdf', TfidfTransformer()),
('mnb',MultinomialNB()),
])
simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)
This allows you to have easier code and less likely to fit_transform your test data.
Upvotes: 2
Reputation: 531
From the code you have provided it looks like you may have forgotten to convert/transform X_test
like you did with X_train
.
Update:
As for the new error that is now displayed in the question:
ValueError: dimension mismatch
Since the transformer has already been fitted to the training set, you should just call .transform()
on the test set:
tfidf_transformer.transform(X_test_counts)
More info here.
Upvotes: 2