Reputation: 13

Filtering a large sparse matrix in python

I have a large (60,000 x 60,000) symmetric document similarity matrix stored in the scipy sparse csr_matrix format.

I want to find the indices of all values that are above a certain value. In other words, all the document pairs that have a similarity score greater than a certain value.

When I try something like

matrix > 0.9

my ipython kernel crashes.

I'm new to scipy and numpy, so any help would be greatly appreciated.

Upvotes: 1

Answers (1)

Erotemic

Reputation: 5238

I would try performing the operation on a smaller set of data I just tried

In [22]: import scipy.sparse as sps
In [23]: m = sps.csr_matrix(np.random.rand(100,100))

In [24]: m
Out[24]: 
<100x100 sparse matrix of type '<type 'numpy.float64'>'
    with 10000 stored elements in Compressed Sparse Row format>

In [25]: m > .5
Out[25]: 
<100x100 sparse matrix of type '<type 'numpy.bool_'>'
    with 5028 stored elements in Compressed Sparse Row format>

So that seemed to work. Maybe your matrix is too big / dense. Did you build scipy yourself? Perhaps there is a build error causing it to crash.

What is your OS / version of python / version of scipy?

import scipy
scipy.__version__

Upvotes: 1

Filtering a large sparse matrix in python

Answers (1)

Related Questions