slaw
slaw

Reputation: 6869

Check Density of a Scipy Sparse Matrix

Is there a good way to measure or check the density of a scipy.sparse matrix?

For example:

import scipy.sparse
import numpy as np

row  = np.array([0,3,1,0])
col  = np.array([0,3,1,2])
data = np.array([4,5,7,9])

mat = scipy.sparse.coo_matrix((data,(row,col)), shape=(4,4))
print mat.todense()

[[4 0 9 0]
 [0 7 0 0]
 [0 0 0 0]
 [0 0 0 5]]

Maybe return something that gives me the general statistics for the overall density like average occupancy per row (i.e., first row occupies 2/4 values, second row occupies 1/4, third row occupies 0/4, and fourth row occupies 1/4 and so the average occupancy/density would be 1/4), stddev, variance, etc. Perhaps there's a better density metric that one could apply that is not dependent on the size of the matrix (assuming that it is sufficiently large).

Upvotes: 4

Views: 8283

Answers (4)

Leo
Leo

Reputation: 319

You can count all elements in mat as all=sum(mat.count()) Next, you can count all zeros as zeros=all-count_nonzero(mat) From these values you estimate density as density=zeros/all

Upvotes: 0

danodonovan
danodonovan

Reputation: 20341

To get a simple density score for mat (ie the fraction of non-zero elements in the matrix) I use something like;

density = mat.getnnz() / np.prod(mat.shape)

Upvotes: 9

David Maust
David Maust

Reputation: 8270

One approach is to use the getnnz() method to identify the number of non-zero items in a given row, column or the matrix as a whole.

Let's start with an example sparse matrix sp_mat.

sp_mat.todense()

matrix([[0, 1, 1, 1, 1],
        [1, 0, 1, 0, 0]])

Non-zero element counts in the whole matrix:

sp_mat.getnnz()
# 6

Non-zero element counts in a given row:

sp_mat[0,:].getnnz()
# 4

Non-zero element counts for all rows:

sp_mat.getnnz(axis=1)
# array([4, 2], dtype=int32)

Non-zero element counts in a column:

sp_mat[:,1].getnnz()
# 1

Non-zero element counts for all columns:

sp_mat.getnnz(axis=0)
#  array([1, 1, 2, 1, 1])

This could be compared with the shape of the matrix to compute a density:

sp_mat.shape
# (2, 5)

Upvotes: 11

hpaulj
hpaulj

Reputation: 231385

I'm not aware of any such density function, but you could search the sparse documentation.

It is easy to get the number of nonzero elements, for the whole array, and by iteration for each row.

mat.nnz
Out[55]: 4

[i.nnz for i in mat.tolil()]
Out[57]: [2, 1, 0, 1]

I used tolil because coo does not allow row iteration (or indexing). csr would also work.

You could also use the attributes of the lil format directly, since they are lists of lists. This is quite a bit faster than iterating on the rows of the lil format. That operation creates a new sparse matrix at each iteration, which is over all a slow operation.

mal=mat.tolil()

mal.data
Out[65]: array([[4, 9], [7], [], [5]], dtype=object)

mal.rows
Out[67]: array([[0, 2], [1], [], [3]], dtype=object)

[len(i) for i in mal.rows]
Out[68]: [2, 1, 0, 1]

Turn this into a array, and calculate all the statistics you want:

In [76]: s=np.array([len(i) for i in mal.rows])

In [77]: np.mean(s/4.)
Out[77]: 0.25

In [78]: np.std(s/4.)
Out[78]: 0.17677669529663689

It may be faster to apply this row count to the dense array

In [93]: timeit [np.count_nonzero(i) for i in mat.A]
10000 loops, best of 3: 44.3 µs per loop

In [94]: timeit [i.nnz for i in mat.tolil()]
100 loops, best of 3: 2.67 ms per loop

I just realized that with the dense version, at least, you can get the nonzero count without iteration - sum a boolean:

In [6]: (mat.A!=0).sum(axis=1)
Out[6]: array([2, 1, 0, 1])

(though for this small sample array this is slower than the other dense version).

The sparse version also works, but is slower (but faster than the iterative sparse). Mostly it's the boolean test; the row summation is done with a matrix multiply.

In [9]: (mat!=0).sum(axis=1)
Out[9]: 
matrix([[2],
        [1],
        [0],
        [1]])

This is a faster sparse summation method:

In [13]: mat1=mat.tocsr(); mat1.data[:]=1;mat1.sum(axis=1)
Out[13]: 
matrix([[2],
        [1],
        [0],
        [1]])

tocsr makes a copy; we change its data to all ones, and sum them.

So if speed's important, you'll need to do your own time tests with a realistic size matrix.

Upvotes: 0

Related Questions