Reputation: 6869
Is there a good way to measure or check the density of a scipy.sparse matrix?
For example:
import scipy.sparse
import numpy as np
row = np.array([0,3,1,0])
col = np.array([0,3,1,2])
data = np.array([4,5,7,9])
mat = scipy.sparse.coo_matrix((data,(row,col)), shape=(4,4))
print mat.todense()
[[4 0 9 0]
[0 7 0 0]
[0 0 0 0]
[0 0 0 5]]
Maybe return something that gives me the general statistics for the overall density like average occupancy per row (i.e., first row occupies 2/4 values, second row occupies 1/4, third row occupies 0/4, and fourth row occupies 1/4 and so the average occupancy/density would be 1/4), stddev, variance, etc. Perhaps there's a better density metric that one could apply that is not dependent on the size of the matrix (assuming that it is sufficiently large).
Upvotes: 4
Views: 8283
Reputation: 319
You can count all elements in mat as
all=sum(mat.count())
Next, you can count all zeros as
zeros=all-count_nonzero(mat)
From these values you estimate density as
density=zeros/all
Upvotes: 0
Reputation: 20341
To get a simple density score for mat
(ie the fraction of non-zero elements in the matrix) I use something like;
density = mat.getnnz() / np.prod(mat.shape)
Upvotes: 9
Reputation: 8270
One approach is to use the getnnz()
method to identify the number of non-zero items in a given row, column or the matrix as a whole.
Let's start with an example sparse matrix sp_mat
.
sp_mat.todense()
matrix([[0, 1, 1, 1, 1],
[1, 0, 1, 0, 0]])
Non-zero element counts in the whole matrix:
sp_mat.getnnz()
# 6
Non-zero element counts in a given row:
sp_mat[0,:].getnnz()
# 4
Non-zero element counts for all rows:
sp_mat.getnnz(axis=1)
# array([4, 2], dtype=int32)
Non-zero element counts in a column:
sp_mat[:,1].getnnz()
# 1
Non-zero element counts for all columns:
sp_mat.getnnz(axis=0)
# array([1, 1, 2, 1, 1])
This could be compared with the shape of the matrix to compute a density:
sp_mat.shape
# (2, 5)
Upvotes: 11
Reputation: 231385
I'm not aware of any such density function, but you could search the sparse
documentation.
It is easy to get the number of nonzero elements, for the whole array, and by iteration for each row.
mat.nnz
Out[55]: 4
[i.nnz for i in mat.tolil()]
Out[57]: [2, 1, 0, 1]
I used tolil
because coo
does not allow row iteration (or indexing). csr
would also work.
You could also use the attributes of the lil
format directly, since they are lists of lists. This is quite a bit faster than iterating on the rows of the lil
format. That operation creates a new sparse matrix at each iteration, which is over all a slow operation.
mal=mat.tolil()
mal.data
Out[65]: array([[4, 9], [7], [], [5]], dtype=object)
mal.rows
Out[67]: array([[0, 2], [1], [], [3]], dtype=object)
[len(i) for i in mal.rows]
Out[68]: [2, 1, 0, 1]
Turn this into a array, and calculate all the statistics you want:
In [76]: s=np.array([len(i) for i in mal.rows])
In [77]: np.mean(s/4.)
Out[77]: 0.25
In [78]: np.std(s/4.)
Out[78]: 0.17677669529663689
It may be faster to apply this row count to the dense array
In [93]: timeit [np.count_nonzero(i) for i in mat.A]
10000 loops, best of 3: 44.3 µs per loop
In [94]: timeit [i.nnz for i in mat.tolil()]
100 loops, best of 3: 2.67 ms per loop
I just realized that with the dense version, at least, you can get the nonzero count without iteration - sum a boolean:
In [6]: (mat.A!=0).sum(axis=1)
Out[6]: array([2, 1, 0, 1])
(though for this small sample array this is slower than the other dense version).
The sparse version also works, but is slower (but faster than the iterative sparse). Mostly it's the boolean test; the row summation is done with a matrix multiply.
In [9]: (mat!=0).sum(axis=1)
Out[9]:
matrix([[2],
[1],
[0],
[1]])
This is a faster sparse summation method:
In [13]: mat1=mat.tocsr(); mat1.data[:]=1;mat1.sum(axis=1)
Out[13]:
matrix([[2],
[1],
[0],
[1]])
tocsr
makes a copy; we change its data
to all ones, and sum them.
So if speed's important, you'll need to do your own time tests with a realistic size matrix.
Upvotes: 0