Istvan
Istvan

Reputation: 8562

How to evaluate text based models with scikit-learn?

I have the following dataframe with data:

index   field1      field2            field3
1079    COMPUTER    long text....     3

Field1 is a category and field2 is a description and field3 is just an integer representation of field1.

I am using the following code to learn field2 to category mappings with sklearn:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)

After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.

X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test) 

It throws the following error:

ValueError: dimension mismatch

Is there a way test the model and get the score or accuracy with such dataset?

UPDATE: Adding similar transformation to the test set.

Upvotes: 1

Views: 599

Answers (3)

Amir
Amir

Reputation: 16587

The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].

To fix your issue change your code to something like this:

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)

clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)

To enhance your code use Pipeline:

from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
text_clf.predict(X_test)

Upvotes: 3

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15568

You should only transform your test data. Not fit_transform. You fit_transform training data and only transform test data. So if you remove “fit_” on the test data, it should work.

It is better to use pipelines that will do transformation and then train/score/predict. E.g.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

model = Pipeline(steps = [    
            ('word_vec', CountVectorizer()),
            ('word_tdf',  TfidfTransformer()),
            ('mnb',MultinomialNB()),
        ])

simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)

This allows you to have easier code and less likely to fit_transform your test data.

Upvotes: 2

runcoderun
runcoderun

Reputation: 531

From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.

Update:
As for the new error that is now displayed in the question:

ValueError: dimension mismatch

Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:

tfidf_transformer.transform(X_test_counts)

More info here.

Upvotes: 2

Related Questions