How do I resolve vectorizer mismatch

Question

I am using TfidfVectorizer for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity.

My Situation looks like: firstly,

def clean_text(text):
    return re.sub(r'[^a-zA-Z0-9 ]', "", text)

movies['title'] = movies['title'].apply(clean_text)

vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words ='english')

title_vec = vectorizer.fit_transform(movies['title'])

title = "Toy Story"

title = clean_text(title)

word_vec  = vectorizer.transform([title])

similarity = cosine_similarity(word_vec, title_vec)

which results in error message:

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 172412 while Y.shape[1] == 156967

PS: I have checked the len of the word_vec and title_vec, they show differing lengths. I set the ngram_range=(1,1) in the vectorizer yet no positive result. I used countvectorizer() but the issue remains

I was out of options and chatGPT provided a solution that didn't solve the problem:

from scipy.sparse import hstack

Pad smaller matrix with zeros

if word_vec.shape[1] > title_vec.shape[1]:
    diff = word_vec.shape[1] - title_vec.shape[1]
    title_vec = hstack([title_vec, np.zeros((title_vec.shape[0], diff))])
elif title_vec.shape[1] > word_vec.shape[1]:
    diff = title_vec.shape[1] - word_vec.shape[1]
    word_vec = hstack([word_vec, np.zeros((word_vec.shape[0], diff))])

so i could not use the code above but i am putting it here to show the extent of this problem.

How do I resolve vectorizer mismatch

Answers (0)

Related Questions