Reputation: 33
I am using TfidfVectorizer
for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity
.
My Situation looks like: firstly,
def clean_text(text):
return re.sub(r'[^a-zA-Z0-9 ]', "", text)
movies['title'] = movies['title'].apply(clean_text)
vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words ='english')
title_vec = vectorizer.fit_transform(movies['title'])
title = "Toy Story"
title = clean_text(title)
word_vec = vectorizer.transform([title])
similarity = cosine_similarity(word_vec, title_vec)
which results in error message:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 172412 while Y.shape[1] == 156967
PS: I have checked the len
of the word_vec
and title_vec
, they show differing lengths.
I set the ngram_range=(1,1)
in the vectorizer yet no positive result.
I used countvectorizer()
but the issue remains
I was out of options and chatGPT provided a solution that didn't solve the problem:
from scipy.sparse import hstack
Pad smaller matrix with zeros
if word_vec.shape[1] > title_vec.shape[1]:
diff = word_vec.shape[1] - title_vec.shape[1]
title_vec = hstack([title_vec, np.zeros((title_vec.shape[0], diff))])
elif title_vec.shape[1] > word_vec.shape[1]:
diff = title_vec.shape[1] - word_vec.shape[1]
word_vec = hstack([word_vec, np.zeros((word_vec.shape[0], diff))])
so i could not use the code above but i am putting it here to show the extent of this problem.
Upvotes: 1
Views: 32