Reputation: 1037
I'm trying to get the cosine similarity between a text and the texts contained on an array.
I have been working over this code:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
text1 = 'Hola me llamo Luis'
text2 = 'Ayer Juan se compró una casa'
text3 = 'Casiguagua está más gordo que un manatí'
text4 = 'Y encima le huelen los pies'
text5 = 'HOlA ME LLAMO PEPE'
tweets = [text1, text2, text3, text4]
vectorizer = TfidfVectorizer(max_features=10000)
vectorizer.fit(tweets)
text1_vector = vectorizer.transform([text1])
text2_vector = vectorizer.transform([text2])
text3_vector = vectorizer.transform([text3])
text4_vector = vectorizer.transform([text4])
text5_vector = vectorizer.transform([text5])
buffer = []
buffer.append(text1_vector)
buffer.append(text2_vector)
buffer.append(text3_vector)
buffer.append(text4_vector)
similarity = cosine_similarity(text5_vector.reshape(1,-1), buffer)
My vectors type are:
scipy.sparse.csr.csr_matrix
So I guess I will have to pass my buffer to a csr_matrix, but I don't know how to do this.
I have also been trying to initialize my buffer as a np.array([])
object, but I don't achieve to add the vectors to the buffer later. Any idea what am I failing on?
Upvotes: 1
Views: 622
Reputation: 7410
You can't append sparse rows
to a numpy array
, what you can do is to stack
dense numpy arrays
like this using vstack
and toarray
:
buffer = np.vstack([text1_vector.toarray(),
text2_vector.toarray(),
text3_vector.toarray(),
text4_vector.toarray()])
similarity = cosine_similarity(text5_vector.toarray(), buffer)
Upvotes: 1