Reputation: 1870
I have two scipy sparse csr matrices with the exact same shape but potentially different data values and nnz value. I now want to get the top 10 elements of one matrix and increase the value on the same indices on the other matrix. My current approach is as follows:
idx = a.data.argpartition(-10)[-10:]
i, j = matrix.nonzero()
i_idx = i[idx]
j_idx = j[idx]
b[i_idx, j_idx] += 1
The reason I have to go this way is that a.data and b.data do not necessarily have the same number of elements and hence the indices would differ.
My question now is whether I can improve this in some way. As far as I know the nonzero procedure is not elegant as I have to allocate two new arrays and I am very tough on memory already. I can get the j_indices via csr_matrix.indices but what about the i_indices? Can I use the indptr in a nice way for that?
Happy for any hints.
Upvotes: 2
Views: 1725
Reputation: 3877
I'm not sure what the "top 10 elements" means. I assume that if you have matrices A and B you want to set B[i, j] += 1 if A[i, j] is within the first 10 nonzero entries of A (in CSR format). I also assume that B[i, j] could be zero, which is the worst case performance wise, since you need to modify the sparsity structure of your matrix.
CSR is not an ideal format to use for changing sparsity structure. This is because every new insertion/deletion is O(nnzs) complexity (assuming the CSR storage is backed by an array - and it usually is).
You could use the DOK format for your second matrix (the one you are modifying), which provides O(1) access to elements.
I haven't benchmarked this but I suppose your option is 10 * O(nnzs) at worst, when you are adding 10 new nonzero values, whereas the DOK version should need O(nnzs) to build the matrix, then O(1) for each insertion and, finally, O(nnzs) to convert it back to CSR (assuming this is needed).
Upvotes: 0