Reputation: 6869
I have many lists that represent a sparse matrix (i.e., the columns that have nonzero entries) that I need to represent as a SciPy sparse csc_matrix
. However, note that there is only one row in my sparse matrix and so the list simply points to the columns within this row that has nonzero entries. For example:
sparse_input = [4, 10, 21] # My lists are much, much longer but very sparse
This list tells me which columns within my single row sparse matrix where there are nonzero values. This is what the dense matrix would look like.
x = np.array([[0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1]])
I could use the (data, (row, col))
syntax but since my lists are super long the csc_matrix
takes a lot of time and memory to build. So, I was thinking about using the indptr
interface but I'm having trouble figuring out how to quickly and automatically build the indptr
directly from a given sparse list of nonzero column entries. I tried looking at csr_matrix(x).indptr
and I see that the indptr
looks like:
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3], dtype=int32)
I've read the SciPy docs and the Sparse Matrix Wikipedia page but I can't seem to come up with an efficient method to construct the indptr
directly from a list of nonzero columns. It just feels like indptr
shouldn't be this long in length considering that there are only three nonzero entries in the sparse matrix.
Upvotes: 2
Views: 406
Reputation: 231325
How about just making the matrices, and exploring their attributs?
In [144]: from scipy import sparse
In [145]: x = np.array([[0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1]])
In [146]: M = sparse.coo_matrix(x)
In [147]: M
Out[147]:
<1x22 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [148]: M.row
Out[148]: array([0, 0, 0], dtype=int32)
In [149]: M.col
Out[149]: array([ 4, 10, 21], dtype=int32)
In [150]: M.data
Out[150]: array([1, 1, 1])
csr:
In [152]: Mr = M.tocsr()
In [153]: Mr.indptr
Out[153]: array([0, 3], dtype=int32)
In [155]: Mr.indices
Out[155]: array([ 4, 10, 21], dtype=int32)
In [156]: Mr.data
Out[156]: array([1, 1, 1], dtype=int64)
csc:
In [157]: Mc = M.tocsc()
In [158]: Mc.indptr
Out[158]:
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3], dtype=int32)
In [159]: Mc.indices
Out[159]: array([0, 0, 0], dtype=int32)
In [160]: Mc.data
Out[160]: array([1, 1, 1], dtype=int64)
And the direct nonzero
on x
:
In [161]: np.nonzero(x)
Out[161]: (array([0, 0, 0]), array([ 4, 10, 21]))
For a 1 row matrix like this, I doubt if you'll save much time by creating the csr
indptr
directly. Most of the work will be in the nonzero
step. But feel free to experiement.
===
Some timings
In [162]: timeit sparse.coo_matrix(x)
95.8 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [163]: timeit sparse.csr_matrix(x)
335 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [164]: timeit M.tocsr()
115 µs ± 948 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [165]: timeit M.tocsc()
117 µs ± 90.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [166]: sparse.csr_matrix?
In [167]: timeit M.tocsc()
117 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [168]: timeit sparse.csc_matrix(x)
335 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: timeit sparse.coo_matrix(x).tocsr()
219 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm a little surprised that the csr_matrix
is slower than coo
followed by conversion.
Now let's try to make the matrix with indptr
etc.
In [170]: timeit np.nonzero(x)
2.52 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [173]: timeit sparse.csr_matrix((Mr.data, Mr.indices, Mr.indptr))
92.5 µs ± 79.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: %%timeit
...: indices = np.nonzero(x)[1]
...: data = np.ones_like(indices)
...: indptr = np.array([0,len(indices)])
...: sparse.csr_matrix((data, indices, indptr))
...:
...:
161 µs ± 605 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 0