ValueError: X has 1709 features per sample; expecting 2444

Question

I am using this code:

import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
import re

Using TFIDF vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
tv=TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')

Loading data files

df=pd.read_json('train.json',orient='columns')
test_df=pd.read_json('test.json',orient='columns')

df['seperated_ingredients'] = df['ingredients'].apply(','.join)
test_df['seperated_ingredients'] = test_df['ingredients'].apply(','.join)

df['seperated_ingredients']=df['seperated_ingredients'].str.lower()
test_df['seperated_ingredients']=test_df['seperated_ingredients'].str.lower()

cuisines={'thai':0,'vietnamese':1,'spanish':2,'southern_us':3,'russian':4,'moroccan':5,'mexican':6,'korean':7,'japanese':8,'jamaican':9,'italian':10,'irish':11,'indian':12,'greek':13,'french':14,'filipino':15,'chinese':16,'cajun_creole':17,'british':18,'brazilian':19 }
df.cuisine= [cuisines[item] for item in df.cuisine]

Doing preprocessing

ho=df['seperated_ingredients']
ho=ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho=ho.replace('\'"',regex=True)

ho=tv.fit_transform(ho)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ho,df['cuisine'],random_state=0)


from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(penalty='l1')
clf.fit(X_train, y_train)
clf.score(X_test,y_test)

from sklearn.linear_model import LogisticRegression
clf1= LogisticRegression(penalty='l1')
clf1.fit(ho,df['cuisine'])

hs=test_df['seperated_ingredients']

hs=hs.replace(r'#([^\s]+)', r'\1', regex=True)
hs=hs.replace('\'"',regex=True)
hs=tv.fit_transform(hs)

ss=clf1.predict(hs) # this line is giving error.

Getting the above mentioned error while predicting. Does anyone know what I am doing wrong?

Mikhail Stepanov · Accepted Answer

You shouldn't refit tfidf-vectorizer but use the same vectorizer with the same vocabulary shape to encode a test data. There are method descriptions from the docs:

fit_transform(raw_documents, y=None)
  Learn vocabulary and idf, return term-document matrix.
  This is equivalent to fit followed by transform, but more efficiently implemented.

transform(raw_documents, copy=True)
  Transform documents to document-term matrix.
  Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

You have got ValueError: X has 1709 features per sample; expecting 2444 because vectorizer was refitted with test data and new vocabulary was created, so test data was encoded into array of another shape. Just check the size of the vocabulary before and after second fit_transform with print(len(tv.vocabulary_)). Also, tf-idf vocabulary probably was reordered during refitting.

ho=df['seperated_ingredients']
ho=ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho=ho.replace('\'"',regex=True)
ho=tv.fit_transform(ho)

then use pre-trained tf-idf vectorizer to encode the data with transform function:

hs=test_df['seperated_ingredients']
hs=hs.replace(r'#([^\s]+)', r'\1', regex=True)
hs=hs.replace('\'"',regex=True)
hs=tv.transform(hs)

Transformation is carried out with the same vocabulary, so output array has correct shape.

ValueError: X has 1709 features per sample; expecting 2444

Answers (1)

Related Questions