Reputation: 203
I have a 2D sparse matrix "unknown_tfidf"
in size of (1000,10000) which type is :
<class 'scipy.sparse.csr.csr_matrix'>
I need to get y index of this matrix where value is '1'
,I am trying the following method (not sure if it is optimal or even right way!) but I am facing an error:
y=[row.index(1.0) for index, row in enumerate(unknown_tfidf) if int(1.0) in row]
and the error is :
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
my question is how can I get only all the y-indices of such matrix where matrix value is 1?
Upvotes: 0
Views: 1792
Reputation: 231385
Your list comprehension works for a nested list
In [100]: xl=[[0,1,3],[0,0,1],[1,1,0]]
In [101]: [row.index(1) for index, row in enumerate(xl) if 1 in row]
Out[101]: [1, 2, 0]
(note that index
returns just the first match in the third row).
but does not work for a numpy.array
:
In [102]: xa=np.array(xl)
In [103]: [row.index(1) for index, row in enumerate(xa) if 1 in row]
...
AttributeError: 'numpy.ndarray' object has no attribute 'index'
and not for a sparse matrix:
In [104]: xs=sparse.csr_matrix(xl)
In [105]: xs
Out[105]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in Compressed Sparse Row format>
In [106]: [row.index(1) for index, row in enumerate(xs) if 1 in row]
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
If I remove the if
test I get a different error, a variation on the dense array error.
In [108]: [row.index(1) for index, row in enumerate(xs)]
...
AttributeError: index not found
Look at what the enumerate gives us to work with;
In [109]: [(index,row) for index, row in enumerate(xs)]
Out[109]:
[(0, <1x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>),
(1, <1x3 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Compressed Sparse Row format>),
(2, <1x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>)]
row
is a another sparse matrix, the same as xs[0]
, etc. So the 1 in row
and row.index(1)
expressions have to work with an array or matrix, else you get an error.
We've already seen that neither has the index
method. That is a list method - you have to use something else for arrays or sparse matrices. Your comprehension has the if
clause because the list index
raises a error if that item is not found. In that sense the if in
and index
go together.
in
works for an array, but gives value error for the sparse matrix:
In [114]: 1 in xa[0]
Out[114]: True
In [115]: 1 in xs[0]
....
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
More commonly this ValueError is produced by the equivalent of:
In [117]: if np.array([True, False, True]):'yes'
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
that is, giving an if
a boolean array. In your case this failure occurs within the sparse
code. In effect in
has not been implemented for sparse.
So if you insist on using this list comprehension approach, you'll have to turn your sparse matrix into a list of lists:
In [120]: [row.index(1) for index, row in enumerate(xs.toarray().tolist()) if 1 in row]
Out[120]: [1, 2, 0]
Here's a variation on unutbu's
answer:
Use a matrix/array equality test to find ALL the elements that match:
In [121]: xs==1
Out[121]:
<3x3 sparse matrix of type '<class 'numpy.bool_'>'
with 4 stored elements in Compressed Sparse Row format>
In [122]: (xs==1).A
Out[122]:
array([[False, True, False],
[False, False, True],
[ True, True, False]], dtype=bool)
Then use a builtin method to get the indices of those True
elements:
In [123]: (xs==1).nonzero()
Out[123]: (array([0, 1, 2, 2], dtype=int32), array([1, 2, 0, 1], dtype=int32))
The second element of that tuple is the list you want (with 2 values for the 3rd row).
Or to collect values for rows (remember, in iterating each row is a matrix)
In [125]: [i.nonzero() for i in (xs==1)]
Out[125]:
[(array([0], dtype=int32), array([1], dtype=int32)),
(array([0], dtype=int32), array([2], dtype=int32)),
(array([0, 0], dtype=int32), array([0, 1], dtype=int32))]
reducing that list to simple list of indices takes more fiddling
In [131]: [i.nonzero()[1].tolist() for i in (xs==1)]
Out[131]: [[1], [2], [0, 1]]
Upvotes: 1
Reputation: 879859
The index of the columns where the Compressed Sparse Row (CSR) matrix equals 1 are stores in its .indices
attribute:
import numpy as np
import scipy.sparse as sparse
np.random.seed(2016)
arr = np.round(10*sparse.rand(10, 10, density=0.8, format='csr'))
# arr.A
# array([[ 5., 0., 7., 7., 8., 7., 0., 2., 4., 2.],
# [ 4., 0., 9., 2., 4., 8., 4., 2., 5., 9.],
# [ 7., 4., 4., 2., 4., 0., 0., 0., 6., 0.],
# [ 8., 0., 0., 7., 0., 6., 5., 8., 0., 3.],
# [ 3., 5., 1., 0., 0., 7., 3., 8., 3., 0.],
# [ 8., 6., 7., 0., 8., 2., 7., 0., 1., 1.],
# [ 4., 6., 3., 1., 8., 7., 8., 6., 0., 2.],
# [ 7., 7., 0., 10., 6., 2., 4., 2., 1., 10.],
# [ 10., 0., 4., 8., 1., 1., 3., 1., 9., 1.],
# [ 0., 4., 0., 0., 7., 2., 10., 1., 9., 0.]])
condition = (arr == 1)
print(condition.indices)
yields
[2 8 9 3 8 4 5 7 9 7]
The fastest way to find both the row and column indices where arr
equals 1, is to convert arr
to a COO matrix, then read off its row
and col
attributes:
coo = condition.tocoo()
print(coo.row)
print(coo.col)
yields
[4 5 5 6 7 8 8 8 8 9]
[2 8 9 3 8 4 5 7 9 7]
Upvotes: 2