Far
Far

Reputation: 203

finding y index of a sparse 2D matrix by its value in python

I have a 2D sparse matrix "unknown_tfidf"in size of (1000,10000) which type is :

<class 'scipy.sparse.csr.csr_matrix'>

I need to get y index of this matrix where value is '1',I am trying the following method (not sure if it is optimal or even right way!) but I am facing an error:

y=[row.index(1.0) for index, row in enumerate(unknown_tfidf) if int(1.0) in row]

and the error is :

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

my question is how can I get only all the y-indices of such matrix where matrix value is 1?

Upvotes: 0

Views: 1792

Answers (2)

hpaulj
hpaulj

Reputation: 231385

Your list comprehension works for a nested list

In [100]: xl=[[0,1,3],[0,0,1],[1,1,0]]
In [101]: [row.index(1) for index, row in enumerate(xl) if 1 in row]
Out[101]: [1, 2, 0]

(note that index returns just the first match in the third row).

but does not work for a numpy.array:

In [102]: xa=np.array(xl)
In [103]: [row.index(1) for index, row in enumerate(xa) if 1 in row]
...
AttributeError: 'numpy.ndarray' object has no attribute 'index'

and not for a sparse matrix:

In [104]: xs=sparse.csr_matrix(xl)
In [105]: xs
Out[105]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>
In [106]: [row.index(1) for index, row in enumerate(xs) if 1 in row]
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

If I remove the if test I get a different error, a variation on the dense array error.

In [108]: [row.index(1) for index, row in enumerate(xs)]
...
AttributeError: index not found

Look at what the enumerate gives us to work with;

In [109]: [(index,row) for index, row in enumerate(xs)]
Out[109]: 
[(0, <1x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>),
 (1, <1x3 sparse matrix of type '<class 'numpy.int32'>'
    with 1 stored elements in Compressed Sparse Row format>),
 (2, <1x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>)]

row is a another sparse matrix, the same as xs[0], etc. So the 1 in row and row.index(1) expressions have to work with an array or matrix, else you get an error.

We've already seen that neither has the index method. That is a list method - you have to use something else for arrays or sparse matrices. Your comprehension has the if clause because the list index raises a error if that item is not found. In that sense the if in and index go together.

in works for an array, but gives value error for the sparse matrix:

In [114]: 1 in xa[0]
Out[114]: True
In [115]: 1 in xs[0]
....
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

More commonly this ValueError is produced by the equivalent of:

In [117]: if np.array([True, False, True]):'yes'
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

that is, giving an if a boolean array. In your case this failure occurs within the sparse code. In effect in has not been implemented for sparse.

So if you insist on using this list comprehension approach, you'll have to turn your sparse matrix into a list of lists:

In [120]: [row.index(1) for index, row in enumerate(xs.toarray().tolist()) if 1 in row]
Out[120]: [1, 2, 0]

Here's a variation on unutbu's answer:

Use a matrix/array equality test to find ALL the elements that match:

In [121]: xs==1
Out[121]: 
<3x3 sparse matrix of type '<class 'numpy.bool_'>'
    with 4 stored elements in Compressed Sparse Row format>
In [122]: (xs==1).A
Out[122]: 
array([[False,  True, False],
       [False, False,  True],
       [ True,  True, False]], dtype=bool)

Then use a builtin method to get the indices of those True elements:

In [123]: (xs==1).nonzero()
Out[123]: (array([0, 1, 2, 2], dtype=int32), array([1, 2, 0, 1], dtype=int32))

The second element of that tuple is the list you want (with 2 values for the 3rd row).

Or to collect values for rows (remember, in iterating each row is a matrix)

In [125]: [i.nonzero() for i in (xs==1)]
Out[125]: 
[(array([0], dtype=int32), array([1], dtype=int32)),
 (array([0], dtype=int32), array([2], dtype=int32)),
 (array([0, 0], dtype=int32), array([0, 1], dtype=int32))]

reducing that list to simple list of indices takes more fiddling

In [131]: [i.nonzero()[1].tolist() for i in (xs==1)]
Out[131]: [[1], [2], [0, 1]]

Upvotes: 1

unutbu
unutbu

Reputation: 879859

The index of the columns where the Compressed Sparse Row (CSR) matrix equals 1 are stores in its .indices attribute:

import numpy as np
import scipy.sparse as sparse
np.random.seed(2016)

arr = np.round(10*sparse.rand(10, 10, density=0.8, format='csr'))
# arr.A
# array([[  5.,   0.,   7.,   7.,   8.,   7.,   0.,   2.,   4.,   2.],
#        [  4.,   0.,   9.,   2.,   4.,   8.,   4.,   2.,   5.,   9.],
#        [  7.,   4.,   4.,   2.,   4.,   0.,   0.,   0.,   6.,   0.],
#        [  8.,   0.,   0.,   7.,   0.,   6.,   5.,   8.,   0.,   3.],
#        [  3.,   5.,   1.,   0.,   0.,   7.,   3.,   8.,   3.,   0.],
#        [  8.,   6.,   7.,   0.,   8.,   2.,   7.,   0.,   1.,   1.],
#        [  4.,   6.,   3.,   1.,   8.,   7.,   8.,   6.,   0.,   2.],
#        [  7.,   7.,   0.,  10.,   6.,   2.,   4.,   2.,   1.,  10.],
#        [ 10.,   0.,   4.,   8.,   1.,   1.,   3.,   1.,   9.,   1.],
#        [  0.,   4.,   0.,   0.,   7.,   2.,  10.,   1.,   9.,   0.]])

condition = (arr == 1)
print(condition.indices)

yields

[2 8 9 3 8 4 5 7 9 7]

The fastest way to find both the row and column indices where arr equals 1, is to convert arr to a COO matrix, then read off its row and col attributes:

coo = condition.tocoo()
print(coo.row)
print(coo.col)

yields

[4 5 5 6 7 8 8 8 8 9]
[2 8 9 3 8 4 5 7 9 7]

Upvotes: 2

Related Questions