Reputation: 617
I'm implementing an efficient PageRank algorithm so I'm using sparse matrices. I'm close, but there's one problem. I have a matrix where I want the sum of each column to be one. This is easy to implement, but the problem occurs when I get a matrix with a zero column.
In this case, I want to set each element in the column to be 1/(n-1) where n is the dimension of the matrix. I divide by n-1 and not n because I wish to keep the diagonals zero, always.
How can I implement this efficiently? My naive solution is to just determine the sum of each column and then find the column indices that are zero and replace the entire column with an 1/(n-1) value like so:
# naive approach (too slow!)
# M is my nxn sparse matrix where each column sums to one
col_sums = M.sum(axis=0)
for i in range(n):
if col_sums[0,i] == 0:
# set entire column to 1/(n-1)
M[:, i] = 1/(n-1)
# make sure diagonal is zeroed
M[i,i] = 0
My M matrix is very very very large and this method simply doesn't scale. How can I do this efficiently?
Upvotes: 0
Views: 165
Reputation: 3985
You can't add new nonzero values without reallocating and copying the underlying data structure. If you expect these zero columns to be very common (> 25% of the data) you should handle them in some other way, or you're better off with a dense array.
Otherwise try this:
import scipy.sparse
M = scipy.sparse.rand(1000, 1000, density=0.001, format='csr')
nz_col_weights = scipy.sparse.csr_matrix(M.shape, dtype=M.dtype)
nz_col_weights[:, M.getnnz(axis=0) == 0] = 1 / (M.shape[0] - 1)
nz_col_weights.setdiag(0)
M += nz_col_weights
This has only two allocation operations
Upvotes: 1