Reputation: 13
I have a large (60,000 x 60,000) symmetric document similarity matrix stored in the scipy sparse csr_matrix format.
I want to find the indices of all values that are above a certain value. In other words, all the document pairs that have a similarity score greater than a certain value.
When I try something like
matrix > 0.9
my ipython kernel crashes.
I'm new to scipy and numpy, so any help would be greatly appreciated.
Upvotes: 1
Views: 2173
Reputation: 5238
I would try performing the operation on a smaller set of data I just tried
In [22]: import scipy.sparse as sps
In [23]: m = sps.csr_matrix(np.random.rand(100,100))
In [24]: m
Out[24]:
<100x100 sparse matrix of type '<type 'numpy.float64'>'
with 10000 stored elements in Compressed Sparse Row format>
In [25]: m > .5
Out[25]:
<100x100 sparse matrix of type '<type 'numpy.bool_'>'
with 5028 stored elements in Compressed Sparse Row format>
So that seemed to work. Maybe your matrix is too big / dense. Did you build scipy yourself? Perhaps there is a build error causing it to crash.
What is your OS / version of python / version of scipy?
import scipy
scipy.__version__
Upvotes: 1