Reputation: 6018
I have a 16000 X 600 boolean matrix. I want to create a distance matrix from it using Jaccard Distance as the metric. The final matrix would be of dimensions would be 16000 X 16000. For this I used scipy.spatial.distance.pdist.
When I run my program after some time the program throws this warning but does not exit
/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py:386: RuntimeWarning: invalid value encountered in double_scalars
return (np.double(np.bitwise_and((u != v), np.bitwise_or(u != 0, v != 0)).sum()) / np.double(np.bitwise_or(u != 0, v != 0).sum()))
and no further output is there even after the program is left to run for some more time.
How can rectify this issue?
Other details:
Upvotes: 0
Views: 2544
Reputation: 17797
Looks like you have two issues:
Your matrix is large and sparse, so try using a sparse representation:
import scipy.sparse
# assuming your big boolean matrix is called A
sA = scipy.sparse.csr_matrix(A)
Any pair of rows with all zeros (in both rows) will produce NaN
values. This is what's triggering the warning.
The reason your code never exits is that the matrix is simply too big. Unfortunately, Scipy's distance function don't support sparse matrices, so you'll have to write the Jaccard distance yourself. (See scipy's dense implementation here.)
Upvotes: 2