GJacobs
GJacobs

Reputation: 109

Scipy: sparse matrix conditional removal of columns

I have a large (79 000 x 480 000) sparse csr matrix. I am trying to remove all columns (within a certain range) for which each value < k.

In regular numpy matrices this is simply done by a mask:

m = np.array([[0,2,1,1],
                [0,4,2,0],
                [0,3,4,0]])
mask = (arr < 2)
idx = mask.all(axis=0)
result = m[:, ~idx]
print result
>>> [[2 1]
     [4 2]
     [3 4]]

The unary bitwise negation operator ~ and boolean mask functionality are not available for sparse matrices however. What is the best method to:

  1. Obtain the indices of columns where all values fulfill condition e < k.
  2. Remove these columns based on the list of indices.

Some things to note:

  1. The columns represent ngram text features: there are no columns in the matrix for which each element is zero.

Is using the csr matrix format even a plausible choice for this? Do I transpose and make use of .nonzero()? I have a fair amount of working memory (192GB) so time efficiency is preferable to memory efficiency.

Upvotes: 2

Views: 2221

Answers (1)

hpaulj
hpaulj

Reputation: 231540

If I do

M = sparse.csr_matrix(m)

M < 2

I get an efficiency warning; all the 0 values of M satisfy the condition,

In [1754]: print(M)
  (0, 1)    2
  (0, 2)    1
  (0, 3)    1
  (1, 1)    4
  (1, 2)    2
  (2, 1)    3
  (2, 2)    4
In [1755]: print(M<2)
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:275: SparseEfficiencyWarning: Comparing a sparse matrix with a scalar greater than zero using < is inefficient, try using >= instead.
  warn(bad_scalar_msg, SparseEfficiencyWarning)
  (0, 0)    True     # not in M
  (0, 2)    True
  (0, 3)    True
  (1, 0)    True    # not in M
  (1, 3)    True
  (2, 0)    True    # not in M
  (2, 3)    True
In [1756]: print(M>=2)   # all a subset of M
  (0, 1)    True
  (1, 1)    True
  (1, 2)    True
  (2, 1)    True
  (2, 2)    True

If I=M>=2; there isn't an all method, but there is a sum.

In [1760]: I.sum(axis=0)
Out[1760]: matrix([[0, 3, 2, 0]], dtype=int32)

sum is actually performed using a matrix multiplication

In [1769]: np.ones((1,3),int)*I
Out[1769]: array([[0, 3, 2, 0]], dtype=int32)

Using nonzero to find the nonzero columns:

In [1778]: np.nonzero(I.sum(axis=0))
Out[1778]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
In [1779]: M[:,np.nonzero(I.sum(axis=0))[1]]
Out[1779]: 
<3x2 sparse matrix of type '<class 'numpy.int32'>'
    with 6 stored elements in Compressed Sparse Row format>
In [1780]: M[:,np.nonzero(I.sum(axis=0))[1]].A
Out[1780]: 
array([[2, 1],
       [4, 2],
       [3, 4]], dtype=int32)

General points:

  • watch out for those 0 values when doing comparisons

  • watch out for False values when doing logic on sparse matrices

  • sparse matrices are optimized for math, especially matrix multiplication

  • sparse indexing isn't quite as powerful as array indexing; and not as fast either.

  • note when operations produce a dense matrix

Upvotes: 4

Related Questions