Reputation: 6191
Yes, I've read this and this answer, but I cannot still grasp my mind around it... it's a basic question.
In:
M[:, index]
M[index, :]
Which one is row slicing, and which one is column slicing?
And to my problem, if I want to do advanced indexing for the columns like:
M[:, indexes] # indexes is an array like [0, 4, 9]
Which sparse matrix type is most efficient for doing M[:, indexes]
, CSR or CSC ?
Upvotes: 8
Views: 4378
Reputation: 231385
While row selection is faster for csr
than for col selection, the difference is not big:
In [288]: Mbig=sparse.rand(1000,1000,.1, 'csr')
In [289]: Mbig[:1000:50,:]
Out[289]:
<20x1000 sparse matrix of type '<class 'numpy.float64'>'
with 2066 stored elements in Compressed Sparse Row format>
In [290]: timeit Mbig[:1000:50,:]
1000 loops, best of 3: 1.53 ms per loop
In [291]: timeit Mbig[:,:1000:50]
100 loops, best of 3: 2.04 ms per loop
In [292]: Mbig=sparse.rand(1000,1000,.1, 'csc')
In [293]: timeit Mbig[:1000:50,:]
100 loops, best of 3: 2.16 ms per loop
In [294]: timeit Mbig[:,:1000:50]
1000 loops, best of 3: 1.65 ms per loop
It isn't worth it to switch format
In [295]: timeit Mbig.tocsr()[:1000:50,:]
...
100 loops, best of 3: 2.46 ms per loop
Contrast this with the same slice for the dense version:
In [297]: A=Mbig.A
In [298]: timeit A[:,:1000:50]
...
1000000 loops, best of 3: 557 ns per loop
In [301]: timeit A[:,:1000:50].copy()
...
10000 loops, best of 3: 52.5 µs per loop
To complicate the comparison, indexing with an array (numpy
advanced) is actually faster than with the 'slice':
In [308]: idx=np.r_[0:1000:50] # expand slice into array
In [309]: timeit Mbig[idx,:]
1000 loops, best of 3: 1.49 ms per loop
In [310]: timeit Mbig[:,idx]
1000 loops, best of 3: 513 µs per loop
Here the column indexing of a csc
has a greater speed improvement.
And single row or column, csr
and csc
have getrow/col
methods:
In [314]: timeit Mbig.getrow(500)
1000 loops, best of 3: 434 µs per loop
In [315]: timeit Mbig.getcol(500) # 1 column from csc is fastest
10000 loops, best of 3: 78.7 µs per loop
In [316]: timeit Mbig[500,:]
1000 loops, best of 3: 505 µs per loop
In [317]: timeit Mbig[:,500]
1000 loops, best of 3: 264 µs per loop
In https://stackoverflow.com/a/39500986/901925 I recreated the extractor
code that sparse
uses to fetch rows or columns. It constructs a new sparse 'vector' of 1s and 0s, and uses matrix multiplication to 'select' rows or columns.
Upvotes: 3
Reputation: 40159
Actually neither is row/column slicing: those are examples of row/column indexing instead.
M[index, :]
is row indexingM[:, index]
is column indexingM[start:stop, :]
is row slicingM[:, start:stop]
is column slicingCSC is more efficient at retrieving entire columns: the non-zero values of a specific column and the matching row indices are internally stored as contiguous arrays in memory.
The dual is true for CSR and the retrieval of an entire row.
Upvotes: 7
Reputation: 696
It's ok to get the order mixed up every once in a while, my trick is to just picture a matrix and remember that the index order counts from up to down and then from left to right: https://en.wikipedia.org/wiki/Index_notation
So, since :
means all, you know that [:, i]
means all rows, and [i, :]
all columns.
To the second part of your question: You want M[:, indices]
, so the trick is in the name: If you loop over your columns (which you do since you specify column indices for all rows), then you want compressed sparse colum format. It says so in the docs you linked:
Advantages of the CSC format
- ...
- efficient column slicing
Upvotes: 2