Reputation: 313
Im trying to solve a clustering problem..I have a list of tf-idf weighted vectors generated by the CountVectorizer() function.This is the data type:
<1000x5369 sparse matrix of type '<type 'numpy.float64'>'
with 42110 stored elements in Compressed Sparse Row format>
I have a "centroid" vector of the following dimension:
<1x5369 sparse matrix of type '<type 'numpy.float64'>'
with 57 stored elements in Compressed Sparse Row format>
When I try to measure the cosine similarity for the centroid and the other vectors in my tfidf_vec_list by the following line of code:
for centroid in centroids:
sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
where the similarity function is:
def cosine_similarity(vector1,vector2):
score=1-scipy.spatial.distance.cosine(vector1,vector2)
return score
I get the error:
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
File "/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py", line 28, in cosine_similarity
score=1-scipy.spatial.distance.cosine(vector1,vector2)
File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 287, in cosine
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 302, in __mul__
raise ValueError(**'dimension mismatch'**)
I have tried everything including converting the matrix to an array and each vector to a list.But I get the same error!!
Upvotes: 1
Views: 2430
Reputation: 41
scipy.spatial.distance.cosine
appears to not support sparse matrix inputs. Specifically, np.linalg.norm(sparse_vector) fails (see Get norm of numpy sparse matrix rows).
If you convert both input vectors (actually here they are row-vectors in matrix form) to dense versions before passing them, it works fine:
>>> xs
<1x4 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> ys
<1x4 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> cosine(xs, ys)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/scipy/spatial/distance.py", line 296, in cosine
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
File "/usr/lib/python3.4/site-packages/scipy/sparse/base.py", line 308, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
>>> cosine(xs.todense(), ys.todense())
-2.2204460492503131e-16
This should be fine for only individual 5369-element vectors (as opposed to the whole matrix).
Upvotes: 4