juanmirocks
juanmirocks

Reputation: 6191

What is row slicing vs What is column slicing?

Yes, I've read this and this answer, but I cannot still grasp my mind around it... it's a basic question.

In:

M[:, index]
M[index, :]

Which one is row slicing, and which one is column slicing?

And to my problem, if I want to do advanced indexing for the columns like:

M[:, indexes]  # indexes is an array like [0, 4, 9]

Which sparse matrix type is most efficient for doing M[:, indexes], CSR or CSC ?

Upvotes: 8

Views: 4378

Answers (3)

hpaulj
hpaulj

Reputation: 231385

While row selection is faster for csr than for col selection, the difference is not big:

In [288]: Mbig=sparse.rand(1000,1000,.1, 'csr')
In [289]: Mbig[:1000:50,:]
Out[289]: 
<20x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 2066 stored elements in Compressed Sparse Row format>
In [290]: timeit Mbig[:1000:50,:]
1000 loops, best of 3: 1.53 ms per loop
In [291]: timeit Mbig[:,:1000:50]
100 loops, best of 3: 2.04 ms per loop

In [292]: Mbig=sparse.rand(1000,1000,.1, 'csc')
In [293]: timeit Mbig[:1000:50,:]
100 loops, best of 3: 2.16 ms per loop
In [294]: timeit Mbig[:,:1000:50]
1000 loops, best of 3: 1.65 ms per loop

It isn't worth it to switch format

In [295]: timeit Mbig.tocsr()[:1000:50,:]
...
100 loops, best of 3: 2.46 ms per loop

Contrast this with the same slice for the dense version:

In [297]: A=Mbig.A
In [298]: timeit A[:,:1000:50]
...
1000000 loops, best of 3: 557 ns per loop
In [301]: timeit A[:,:1000:50].copy()
...
10000 loops, best of 3: 52.5 µs per loop

To complicate the comparison, indexing with an array (numpy advanced) is actually faster than with the 'slice':

In [308]: idx=np.r_[0:1000:50]    # expand slice into array
In [309]: timeit Mbig[idx,:]
1000 loops, best of 3: 1.49 ms per loop
In [310]: timeit Mbig[:,idx]
1000 loops, best of 3: 513 µs per loop

Here the column indexing of a csc has a greater speed improvement.

And single row or column, csr and csc have getrow/col methods:

In [314]: timeit Mbig.getrow(500)
1000 loops, best of 3: 434 µs per loop
In [315]: timeit Mbig.getcol(500)        # 1 column from csc is fastest
10000 loops, best of 3: 78.7 µs per loop
In [316]: timeit Mbig[500,:]
1000 loops, best of 3: 505 µs per loop
In [317]: timeit Mbig[:,500]
1000 loops, best of 3: 264 µs per loop

In https://stackoverflow.com/a/39500986/901925 I recreated the extractor code that sparse uses to fetch rows or columns. It constructs a new sparse 'vector' of 1s and 0s, and uses matrix multiplication to 'select' rows or columns.

Upvotes: 3

ogrisel
ogrisel

Reputation: 40159

Actually neither is row/column slicing: those are examples of row/column indexing instead.

  • M[index, :] is row indexing
  • M[:, index] is column indexing
  • M[start:stop, :] is row slicing
  • M[:, start:stop] is column slicing

CSC is more efficient at retrieving entire columns: the non-zero values of a specific column and the matching row indices are internally stored as contiguous arrays in memory.

The dual is true for CSR and the retrieval of an entire row.

Upvotes: 7

instant
instant

Reputation: 696

It's ok to get the order mixed up every once in a while, my trick is to just picture a matrix and remember that the index order counts from up to down and then from left to right: https://en.wikipedia.org/wiki/Index_notation

down, left

So, since : means all, you know that [:, i] means all rows, and [i, :] all columns.

To the second part of your question: You want M[:, indices], so the trick is in the name: If you loop over your columns (which you do since you specify column indices for all rows), then you want compressed sparse colum format. It says so in the docs you linked:

Advantages of the CSC format

  • ...
  • efficient column slicing

Upvotes: 2

Related Questions