user12261788
user12261788

Reputation: 45

Explanation of "Dimension mismatch" after using fit_transform on testing data

I was reading some code about NLP and saw that X_test does not have fit_transform when assigned (last line of code below).

When I tried to do it with fit_transform like the X_trainand continued to use a predictive model it returned:

ValueError: dimension mismatch

This question is about that case: SciPy and scikit-learn - ValueError: Dimension mismatch

What I would like is a simple explanation as to why it occurs be cause I don't understand it clearly.

Below is the code I have:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

categories = ['alt.atheism', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,  
                                      remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,  
                                     remove=('headers', 'footers', 'quotes'))
y_train = newsgroups_train.target
y_test = newsgroups_test.target
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data) #here is the cause of the error if it had 'fit_transform' instead

Upvotes: 1

Views: 2109

Answers (1)

Kidae Kim
Kidae Kim

Reputation: 499

When you use TfidfVectorizer().fit_transform(), it first counts the number of unique vocabulary (feature) in your data and then its frequencies. Your training and test data do not have the same number of unique vocabulary. Thus, the dimension of your X_test and X_train does not match if you .fit_transform() on each of your train and test data. Therefore, your predictive model gets lost and gives you dimension mismatch error.

If you .fit_transform() on X_train and then just .transform() on X_test, you only count the vocabulary included in X_train. This ignores any vocabulary included only in X_test, thus matches the number of features.

EDIT: I have written a small example.

from sklearn.feature_extraction.text import TfidfVectorizer

city = ['London Moscow Washington',
        'Washington Boston']

president = ['Adams Washington',
             'Jefferson']

vectorizer = TfidfVectorizer()

First, .fit_transform(city).

X_city = vectorizer.fit_transform(city)
X_city.toarray()

>>>array([[0.        , 0.6316672 , 0.6316672 , 0.44943642],
          [0.81480247, 0.        , 0.        , 0.57973867]])

Then, .transform(president) based on the fit above.

vectorizer.transform(president).toarray()

>>>array([[0., 0., 0., 1.],
          [0., 0., 0., 0.]])

Finally, .fit_transform(president).

X_president = vectorizer.fit_transform(president)
X_president.toarray()

>>>array([[0.70710678, 0.        , 0.70710678],
          [0.        , 1.        , 0.        ]])

It comes down to matching the dimension between train and test data for your model.

Upvotes: 1

Related Questions