ashwin shanker
ashwin shanker

Reputation: 313

Numpy matrix dimensions-tfidf vector

Im trying to solve a clustering problem..I have a list of tf-idf weighted vectors generated by the CountVectorizer() function.This is the data type:

<1000x5369 sparse matrix of type '<type 'numpy.float64'>'
with 42110 stored elements in Compressed Sparse Row format>

I have a "centroid" vector of the following dimension:

<1x5369 sparse matrix of type '<type 'numpy.float64'>'
with 57 stored elements in Compressed Sparse Row format>

When I try to measure the cosine similarity for the centroid and the other vectors in my tfidf_vec_list by the following line of code:

for centroid in centroids:
sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]

where the similarity function is:

def cosine_similarity(vector1,vector2):
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
    return score

I get the error:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
  File "/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py", line 28, in             cosine_similarity
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
  File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 287, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
    File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 302, in __mul__
    raise ValueError(**'dimension mismatch'**)

I have tried everything including converting the matrix to an array and each vector to a list.But I get the same error!!

Upvotes: 1

Views: 2430

Answers (1)

HapeMask
HapeMask

Reputation: 41

scipy.spatial.distance.cosine appears to not support sparse matrix inputs. Specifically, np.linalg.norm(sparse_vector) fails (see Get norm of numpy sparse matrix rows).

If you convert both input vectors (actually here they are row-vectors in matrix form) to dense versions before passing them, it works fine:

>>> xs
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> ys
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> cosine(xs, ys)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/scipy/spatial/distance.py", line 296, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
  File "/usr/lib/python3.4/site-packages/scipy/sparse/base.py", line 308, in __mul__
    raise ValueError('dimension mismatch')
ValueError: dimension mismatch
>>> cosine(xs.todense(), ys.todense())
-2.2204460492503131e-16

This should be fine for only individual 5369-element vectors (as opposed to the whole matrix).

Upvotes: 4

Related Questions