Reputation: 109
I have a large (79 000 x 480 000) sparse csr matrix. I am trying to remove all columns (within a certain range) for which each value < k.
In regular numpy matrices this is simply done by a mask:
m = np.array([[0,2,1,1],
[0,4,2,0],
[0,3,4,0]])
mask = (arr < 2)
idx = mask.all(axis=0)
result = m[:, ~idx]
print result
>>> [[2 1]
[4 2]
[3 4]]
The unary bitwise negation operator ~ and boolean mask functionality are not available for sparse matrices however. What is the best method to:
Some things to note:
Is using the csr matrix format even a plausible choice for this? Do I transpose and make use of .nonzero()? I have a fair amount of working memory (192GB) so time efficiency is preferable to memory efficiency.
Upvotes: 2
Views: 2221
Reputation: 231540
If I do
M = sparse.csr_matrix(m)
M < 2
I get an efficiency warning; all the 0 values of M satisfy the condition,
In [1754]: print(M)
(0, 1) 2
(0, 2) 1
(0, 3) 1
(1, 1) 4
(1, 2) 2
(2, 1) 3
(2, 2) 4
In [1755]: print(M<2)
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:275: SparseEfficiencyWarning: Comparing a sparse matrix with a scalar greater than zero using < is inefficient, try using >= instead.
warn(bad_scalar_msg, SparseEfficiencyWarning)
(0, 0) True # not in M
(0, 2) True
(0, 3) True
(1, 0) True # not in M
(1, 3) True
(2, 0) True # not in M
(2, 3) True
In [1756]: print(M>=2) # all a subset of M
(0, 1) True
(1, 1) True
(1, 2) True
(2, 1) True
(2, 2) True
If I=M>=2
; there isn't an all
method, but there is a sum
.
In [1760]: I.sum(axis=0)
Out[1760]: matrix([[0, 3, 2, 0]], dtype=int32)
sum
is actually performed using a matrix multiplication
In [1769]: np.ones((1,3),int)*I
Out[1769]: array([[0, 3, 2, 0]], dtype=int32)
Using nonzero
to find the nonzero columns:
In [1778]: np.nonzero(I.sum(axis=0))
Out[1778]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
In [1779]: M[:,np.nonzero(I.sum(axis=0))[1]]
Out[1779]:
<3x2 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [1780]: M[:,np.nonzero(I.sum(axis=0))[1]].A
Out[1780]:
array([[2, 1],
[4, 2],
[3, 4]], dtype=int32)
General points:
watch out for those 0 values when doing comparisons
watch out for False values when doing logic on sparse matrices
sparse matrices are optimized for math, especially matrix multiplication
sparse indexing isn't quite as powerful as array indexing; and not as fast either.
note when operations produce a dense matrix
Upvotes: 4