Sklearn cosine_similarity between a tfidf vector and an array of tfidf vectors

Question

I'm trying to get the cosine similarity between a text and the texts contained on an array.

I have been working over this code:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

text1 = 'Hola me llamo Luis'
text2 = 'Ayer Juan se compró una casa'
text3 = 'Casiguagua está más gordo que un manatí'
text4 = 'Y encima le huelen los pies'
text5 = 'HOlA ME LLAMO PEPE'

tweets = [text1, text2, text3, text4]

vectorizer = TfidfVectorizer(max_features=10000)
vectorizer.fit(tweets)

text1_vector = vectorizer.transform([text1])
text2_vector = vectorizer.transform([text2])
text3_vector = vectorizer.transform([text3])
text4_vector = vectorizer.transform([text4])
text5_vector = vectorizer.transform([text5])

buffer = []

buffer.append(text1_vector)
buffer.append(text2_vector)
buffer.append(text3_vector)
buffer.append(text4_vector)

similarity = cosine_similarity(text5_vector.reshape(1,-1), buffer)

My vectors type are:

scipy.sparse.csr.csr_matrix

So I guess I will have to pass my buffer to a csr_matrix, but I don't know how to do this.

I have also been trying to initialize my buffer as a np.array([]) object, but I don't achieve to add the vectors to the buffer later. Any idea what am I failing on?

Franco Piccolo · Accepted Answer

You can't append sparse rows to a numpy array, what you can do is to stack dense numpy arrays like this using vstack and toarray:

buffer = np.vstack([text1_vector.toarray(),
                text2_vector.toarray(),
                text3_vector.toarray(),
                text4_vector.toarray()])

similarity = cosine_similarity(text5_vector.toarray(), buffer)

Sklearn cosine_similarity between a tfidf vector and an array of tfidf vectors

Answers (1)

Related Questions