Reputation: 194
I'm trying to find the dot product between a huge matrix and itself.
Shape of the matrix (371744, 36154) Num of NonZero - 577731 [very sparse]
mat1 is scipy.sparse.csr_matrix If i use mat1 * mat1.T I get a value error, this looks like its because there are too many non-zero elements in the resulting matrix and the index pointer overflows according to here
dp_data = data_m * data_m.T
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 247, in __mul__
return self._mul_sparse_matrix(other)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 300, in _mul_sparse_matrix
return self.tocsr()._mul_sparse_matrix(other)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 290, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=np.intc)
ValueError: negative dimensions are not allowed
I also tried np.dot
But the doc says,
"As of NumPy 1.7, np.dot is not aware of sparse matrices, therefore using it will result on unexpected results or errors. The corresponding dense matrix should be obtained first instead"
when I to mat1.toarray() or todense() I get a memory error as the matrix is huge!! I have 16GB of memory! The program seems to work fine for smaller inputs!
data_array = data_m.toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
I'm Using Numpy version 1.8.1 Numpy version 0.9.0
How else do I do this multiplication?
Upvotes: 2
Views: 14048
Reputation: 600
Call the dot product as a method of the sparse matrix:
dp_data = data_m.dot(data_m)
numpy.dot is a Universal Function that is unaware of your matrix's sparsity, whereas scipy.sparse.csc_matrix.dot is a Method that is tailored for your matrix type and, thus, uses a sparse algorithm.
Upvotes: 4
Reputation: 23510
First of all, the size of the sparse output matrix and the amount of CPU work required by the calculations depends on the structure of your sparse matrix. If there are a lot of empty rows, things become easier. If, on the other hand, your data is evenly distributed, there is a lot to calculate.
The first thing to realize is that in this specific case (a * a.T
) you are essentially calculating the dot product of each row (a 36154-element vector) with each row. This helps you cut the calculation time by half, because the result will be symmetric. (This leaves you with approximately 50 000 000 000 vector dot products.)
Now, the question is quite a lot whether you are in a hurry or not. If you are in a hurry (performance is important), then there may be ways to make the calculation faster depending on how the non-zeros are distributed in your data.
A rather straightforward algorithm is as follows:
# a is in row-form (n_row lists of non-zeros in each row)
for r in 0..n_row-1
if all elements in the row are zero
skip the row
for c in r..n_row-1
d := dot(a[c],a[r])
if d <> 0
result[r,c] = d
result[c,r] = d
The dot product is easy to calculate by finding the intersection of the sets of non-zero elements in a[c]
and a[r]
. Most of the time the intersection is empty, and calculations are only required when it is not empty.
Depending on the number of empty rows in your matrix this may be relatively fast. On the other hand, if you do not have any empty rows or columns, there time is spent in calculating 50 000 000 000 set intersections. (Most are in that case between sets of 1, so they are simple comparisons.)
This method requires little memory, but still a lot of time, from hours to days unless there are a lot of empty rows.
Upvotes: 2