Scipy: sparse matrix conditional removal of columns

Question

I have a large (79 000 x 480 000) sparse csr matrix. I am trying to remove all columns (within a certain range) for which each value < k.

In regular numpy matrices this is simply done by a mask:

m = np.array([[0,2,1,1],
                [0,4,2,0],
                [0,3,4,0]])
mask = (arr < 2)
idx = mask.all(axis=0)
result = m[:, ~idx]
print result
>>> [[2 1]
     [4 2]
     [3 4]]

The unary bitwise negation operator ~ and boolean mask functionality are not available for sparse matrices however. What is the best method to:

Obtain the indices of columns where all values fulfill condition e < k.
Remove these columns based on the list of indices.

Some things to note:

The columns represent ngram text features: there are no columns in the matrix for which each element is zero.

Is using the csr matrix format even a plausible choice for this? Do I transpose and make use of .nonzero()? I have a fair amount of working memory (192GB) so time efficiency is preferable to memory efficiency.

hpaulj · Accepted Answer

If I do

M = sparse.csr_matrix(m)

M < 2

I get an efficiency warning; all the 0 values of M satisfy the condition,

In [1754]: print(M)
  (0, 1)    2
  (0, 2)    1
  (0, 3)    1
  (1, 1)    4
  (1, 2)    2
  (2, 1)    3
  (2, 2)    4
In [1755]: print(M<2)
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:275: SparseEfficiencyWarning: Comparing a sparse matrix with a scalar greater than zero using < is inefficient, try using >= instead.
  warn(bad_scalar_msg, SparseEfficiencyWarning)
  (0, 0)    True     # not in M
  (0, 2)    True
  (0, 3)    True
  (1, 0)    True    # not in M
  (1, 3)    True
  (2, 0)    True    # not in M
  (2, 3)    True
In [1756]: print(M>=2)   # all a subset of M
  (0, 1)    True
  (1, 1)    True
  (1, 2)    True
  (2, 1)    True
  (2, 2)    True

If I=M>=2; there isn't an all method, but there is a sum.

In [1760]: I.sum(axis=0)
Out[1760]: matrix([[0, 3, 2, 0]], dtype=int32)

sum is actually performed using a matrix multiplication

In [1769]: np.ones((1,3),int)*I
Out[1769]: array([[0, 3, 2, 0]], dtype=int32)

Using nonzero to find the nonzero columns:

In [1778]: np.nonzero(I.sum(axis=0))
Out[1778]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
In [1779]: M[:,np.nonzero(I.sum(axis=0))[1]]
Out[1779]: 
<3x2 sparse matrix of type ''
    with 6 stored elements in Compressed Sparse Row format>
In [1780]: M[:,np.nonzero(I.sum(axis=0))[1]].A
Out[1780]: 
array([[2, 1],
       [4, 2],
       [3, 4]], dtype=int32)

General points:

watch out for those 0 values when doing comparisons
watch out for False values when doing logic on sparse matrices
sparse matrices are optimized for math, especially matrix multiplication
sparse indexing isn't quite as powerful as array indexing; and not as fast either.
note when operations produce a dense matrix

Scipy: sparse matrix conditional removal of columns

Answers (1)

Related Questions