dfrankow
dfrankow

Reputation: 21387

How to select only some rows of a scipy.sparse csr_matrix?

Here is an example filtering rows from a Pandas dataframe, first dense, then sparse.

import pandas as pd
from scipy.sparse import csr_matrix

df = pd.DataFrame({'thing': [1, 1, 2, 2, 2],
                   'score': [0.12, 0.13, 0.14, 0.15, 0.17]})

row_index = df['thing'] == 1
print(type(row_index), row_index)
print(df[row_index])
sdf = csr_matrix(df)
print(sdf[row_index])

The second print returns only the first two rows. The third print returns an error (see full results below).

How do I fix this code to properly filter the rows of a csr_matrix by row_index, without making it a dense matrix? In my real example, I have results of a TF/IDF vectorizer, so it has thousands of columns and I don't want to make that dense.

I've found some related questions, but I can't tell if the answer is there or not.

I'm using pandas 0.25.3 and scipy 1.3.2.

Full output of code above:

<class 'pandas.core.series.Series'> 0     True
1     True
2    False
3    False
4    False
Name: thing, dtype: bool
   thing  score
0      1   0.12
1      1   0.13
Traceback (most recent call last):
  File "./foo.py", line 13, in <module>
    print(sdf[row_index])
  File "root/.venv/lib/python3.7/site-packages/scipy/sparse/_index.py", line 59, in __getitem__
    return self._get_arrayXslice(row, col)
  File "root/.venv/lib/python3.7/site-packages/scipy/sparse/csr.py", line 325, in _get_arrayXslice
    return self._major_index_fancy(row)._get_submatrix(minor=col)
  File "root/.venv/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 690, in _major_index_fancy
    np.cumsum(row_nnz[idx], out=res_indptr[1:])
  File "<__array_function__ internals>", line 6, in cumsum
  File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2423, in cumsum
    return _wrapfunc(a, 'cumsum', axis=axis, dtype=dtype, out=out)
  File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
ValueError: provided out is the wrong size for the reduction

EDIT: This depends on scipy version. I submitted this issue to scipy.

Upvotes: 1

Views: 2384

Answers (1)

hpaulj
hpaulj

Reputation: 231395

In [175]: sdf = sparse.csr_matrix(df)                                           
In [176]: df                                                                    
Out[176]: 
   thing  score
0      1   0.12
1      1   0.13
2      2   0.14
3      2   0.15
4      2   0.17
In [177]: sdf                                                                   
Out[177]: 
<5x2 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [178]: sdf.A                                                                 
Out[178]: 
array([[1.  , 0.12],
       [1.  , 0.13],
       [2.  , 0.14],
       [2.  , 0.15],
       [2.  , 0.17]])

row_index is a pd.Series:

In [179]: row_index                                                             
Out[179]: 
0     True
1     True
2    False
3    False
4    False
Name: thing, dtype: bool

the array equivalent works as a boolean index:

In [180]: row_index.values                                                      
Out[180]: array([ True,  True, False, False, False])
In [181]: sdf[_]                                                                
Out[181]: 
<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [182]: _.A                                                                   
Out[182]: 
array([[1.  , 0.12],
       [1.  , 0.13]])

The series does work as index with the dense array:

In [185]: (sdf.A)[row_index]                                                    
Out[185]: 
array([[1.  , 0.12],
       [1.  , 0.13]])

But a sparse matrix is not a subclass of ndarray. It is similar in many ways, but uses its own code through out.

Upvotes: 1

Related Questions