Summarize non-zero values in a scipy matrix by axis

Question

I have this code to summarize each row of a scipy sparse csr matrix:

count_list = dtm.toarray().sum(axis=0)

How can I instead summarize each row as if each non-zero value was = 1? I could replace all values >0 with 1, and then use the same code as above. I could also iterate over each row in the matrix and use Numpy's count_nonzero.

count_list = [np.count_nonzero(v) for v in row.toarray() for row in dtm]

Is there any easier, or more straightforward way (similar to the method in the first example)?

fuglede · Accepted Answer

Assuming that you have no explicit zeros, this is

count_list = dtm.indptr[1:] - dtm.indptr[:-1]

For example:

In [34]: dtm = scipy.sparse.random(1000, 1000, format='csr')                                    

In [35]: count_list_np = [np.count_nonzero(v) for row in dtm for v in row.toarray()]            

In [36]: count_list = dtm.indptr[1:] - dtm.indptr[:-1]                                          

In [37]: np.array_equal(count_list, count_list_np)                                              
Out[37]: True

If you do have explicit zeros, simply remove them first, using eliminate_zeros:

dtm.eliminate_zeros()
count_list = dtm.indptr[1:] - dtm.indptr[:-1]

Summarize non-zero values in a scipy matrix by axis

Answers (2)

Related Questions