Reputation: 3109
I'm having a hard time understanding why this behaviour is happening.
I have a scipy sparse csr matrix. The first ten elements are:
print my_mat[0:10,]
(0, 31) 1
(0, 33) 1
(1, 36) 1
(1, 40) 1
(2, 47) 1
(2, 48) 1
(3, 50) 1
(3, 53) 1
(4, 58) 1
(4, 60) 1
(5, 66) 1
(5, 68) 1
(6, 73) 1
(6, 75) 1
(7, 77) 1
(7, 82) 1
(8, 30) 1
(8, 32) 1
(9, 37) 1
(9, 40) 1
When I call indptr
, I get:
m1 = my_mat[0:10,]
print m1.indptr
[ 0 2 4 6 8 10 12 14 16 18 20]
Why don't the values of indptr
equal:
0 0 1 1 2 2 3 3, etc (the first column of my_mat, which is what is implied from the accepted answer to this question)? How can I access those values?
Upvotes: 0
Views: 508
Reputation: 114966
For a CSR matrix, m1.indptr
does not hold the row indices. Instead, for row r
, the pair of values start, end = m1.indptr[r:r+2]
gives the start and end indices into m1.data
of the values that are stored in row r
. That is, m1.data[start:end]
holds the nonzero values in row r
. The columns of these values are in m1.indices[start:end]
.
In your example, you have m1.indptr = [ 0 2 4 6 8 10 12 14 16 18 20]
. So the nonzero values in the first row are stored in m1.data[0:2]
, and the columns where these values are located are stored in m1.indices[0:2]
. The nonzeros values stored in the second row are m1.data[2:4]
, and their columns are m1.indices[2:4]
, etc.
If you want the row and column indices, probably the simplest method is to use the nonzero()
method. For example, here is a CSR matrix:
In [50]: s
Out[50]:
<5x8 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [51]: s.A
Out[51]:
array([[ 0, 10, 40, 0, 0, 20, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 30, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
Here we use the nonzero()
method to get the row and column indices of the nonzero values:
In [71]: row, col = s.nonzero()
In [72]: row
Out[72]: array([0, 0, 0, 2], dtype=int32)
In [73]: col
Out[73]: array([1, 2, 5, 3], dtype=int32)
Alternatively, you could convert the array to the "COO" (coordinate) format. Then you can access the row
and col
attributes:
In [52]: c = s.tocoo()
In [53]: c.row
Out[53]: array([0, 0, 0, 2], dtype=int32)
In [54]: c.col
Out[54]: array([1, 2, 5, 3], dtype=int32)
Upvotes: 2