Reputation: 21387
Here is an example filtering rows from a Pandas dataframe, first dense, then sparse.
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'thing': [1, 1, 2, 2, 2],
'score': [0.12, 0.13, 0.14, 0.15, 0.17]})
row_index = df['thing'] == 1
print(type(row_index), row_index)
print(df[row_index])
sdf = csr_matrix(df)
print(sdf[row_index])
The second print returns only the first two rows. The third print returns an error (see full results below).
How do I fix this code to properly filter the rows of a csr_matrix by row_index, without making it a dense matrix? In my real example, I have results of a TF/IDF vectorizer, so it has thousands of columns and I don't want to make that dense.
I've found some related questions, but I can't tell if the answer is there or not.
I'm using pandas 0.25.3 and scipy 1.3.2.
Full output of code above:
<class 'pandas.core.series.Series'> 0 True
1 True
2 False
3 False
4 False
Name: thing, dtype: bool
thing score
0 1 0.12
1 1 0.13
Traceback (most recent call last):
File "./foo.py", line 13, in <module>
print(sdf[row_index])
File "root/.venv/lib/python3.7/site-packages/scipy/sparse/_index.py", line 59, in __getitem__
return self._get_arrayXslice(row, col)
File "root/.venv/lib/python3.7/site-packages/scipy/sparse/csr.py", line 325, in _get_arrayXslice
return self._major_index_fancy(row)._get_submatrix(minor=col)
File "root/.venv/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 690, in _major_index_fancy
np.cumsum(row_nnz[idx], out=res_indptr[1:])
File "<__array_function__ internals>", line 6, in cumsum
File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2423, in cumsum
return _wrapfunc(a, 'cumsum', axis=axis, dtype=dtype, out=out)
File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
return bound(*args, **kwds)
ValueError: provided out is the wrong size for the reduction
EDIT: This depends on scipy version. I submitted this issue to scipy.
Upvotes: 1
Views: 2384
Reputation: 231395
In [175]: sdf = sparse.csr_matrix(df)
In [176]: df
Out[176]:
thing score
0 1 0.12
1 1 0.13
2 2 0.14
3 2 0.15
4 2 0.17
In [177]: sdf
Out[177]:
<5x2 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [178]: sdf.A
Out[178]:
array([[1. , 0.12],
[1. , 0.13],
[2. , 0.14],
[2. , 0.15],
[2. , 0.17]])
row_index
is a pd.Series:
In [179]: row_index
Out[179]:
0 True
1 True
2 False
3 False
4 False
Name: thing, dtype: bool
the array equivalent works as a boolean index:
In [180]: row_index.values
Out[180]: array([ True, True, False, False, False])
In [181]: sdf[_]
Out[181]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [182]: _.A
Out[182]:
array([[1. , 0.12],
[1. , 0.13]])
The series does work as index with the dense array:
In [185]: (sdf.A)[row_index]
Out[185]:
array([[1. , 0.12],
[1. , 0.13]])
But a sparse matrix is not a subclass of ndarray
. It is similar in many ways, but uses its own code through out.
Upvotes: 1