StatsSorceress
StatsSorceress

Reputation: 3109

Why does indptr not match the values in this csr matrix?

I'm having a hard time understanding why this behaviour is happening.

I have a scipy sparse csr matrix. The first ten elements are:

print my_mat[0:10,]

  (0, 31)       1
  (0, 33)       1
  (1, 36)       1
  (1, 40)       1
  (2, 47)       1
  (2, 48)       1
  (3, 50)       1
  (3, 53)       1
  (4, 58)       1
  (4, 60)       1
  (5, 66)       1
  (5, 68)       1
  (6, 73)       1
  (6, 75)       1
  (7, 77)       1
  (7, 82)       1
  (8, 30)       1
  (8, 32)       1
  (9, 37)       1
  (9, 40)       1

When I call indptr, I get:

m1 = my_mat[0:10,]
print m1.indptr
[ 0  2  4  6  8 10 12 14 16 18 20]

Why don't the values of indptr equal:

0 0 1 1 2 2 3 3, etc (the first column of my_mat, which is what is implied from the accepted answer to this question)? How can I access those values?

Upvotes: 0

Views: 508

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114966

For a CSR matrix, m1.indptr does not hold the row indices. Instead, for row r, the pair of values start, end = m1.indptr[r:r+2] gives the start and end indices into m1.data of the values that are stored in row r. That is, m1.data[start:end] holds the nonzero values in row r. The columns of these values are in m1.indices[start:end].

In your example, you have m1.indptr = [ 0 2 4 6 8 10 12 14 16 18 20]. So the nonzero values in the first row are stored in m1.data[0:2], and the columns where these values are located are stored in m1.indices[0:2]. The nonzeros values stored in the second row are m1.data[2:4], and their columns are m1.indices[2:4], etc.

If you want the row and column indices, probably the simplest method is to use the nonzero() method. For example, here is a CSR matrix:

In [50]: s
Out[50]: 
<5x8 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

In [51]: s.A
Out[51]: 
array([[ 0, 10, 40,  0,  0, 20,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0, 30,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0]], dtype=int64)

Here we use the nonzero() method to get the row and column indices of the nonzero values:

In [71]: row, col = s.nonzero()

In [72]: row
Out[72]: array([0, 0, 0, 2], dtype=int32)

In [73]: col
Out[73]: array([1, 2, 5, 3], dtype=int32)

Alternatively, you could convert the array to the "COO" (coordinate) format. Then you can access the row and col attributes:

In [52]: c = s.tocoo()

In [53]: c.row
Out[53]: array([0, 0, 0, 2], dtype=int32)

In [54]: c.col
Out[54]: array([1, 2, 5, 3], dtype=int32)

Upvotes: 2

Related Questions