Reputation: 473
I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.
:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object.
Here is my attempt:
start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0 ], [0.5, 0.5]),name = 'dist')
data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
row = numpy.empty(self._s*self._n)
print time.time()-start_time
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))
print time.time()-start_time
return S
Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.
Upvotes: 0
Views: 237
Reputation: 231605
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
looks like something that could be written as one non-looping expression; though it probably isn't a big time consumer
My first guess is - but I need to play with this to be sure; I think it's assigning all rows the column index number.
col = np.empty(self._s, self._n)
col[:,:] = np.arange(self._n)
col = col.ravel()
Something similar for:
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
is, I think, picking _s
values from _d
_n
times. Doing the no-replace along _s, but allowing replace on _n
could be tricky.
Without running the code myself (with smaller n
) I'm stumbling around a bit. Which is the slow part, generating col
, row
, or the final csr
? Iteration on n=500000
is going to be slow.
The matrix will be (1000, 500000), but with (10*500000) nonzero items. So a sparsity of .01. Just for comparison it would be interesting to generate a sparse random matrix of similar size and sparsity
In [5]: %timeit sparse.random(1000, 500000, .01)
1 loop, best of 3: 24.6 s per loop
and the dense random choices:
In [8]: timeit np.random.choice(1000,(10,500000)).shape
10 loops, best of 3: 53 ms per loop
In [9]: np.array([np.random.choice(1000,(10,)) for i in range(500000)]).shape
Out[9]: (500000, 10)
In [10]: timeit np.array([np.random.choice(1000,(10,)) for i in range(500000)]).
...: shape
1 loop, best of 3: 12.7 s per loop
So, yes, the large iteration loop is expensive. But given the replacement policy there might not be a way around that. Or is there?
So as first guess, creating row
takes half the time, creating the sparse matrix the other half. I'm not surprised. You are using the coo
style of input, which requires lexsorting
and summing for duplicates when converting to csr
. We might be able to gain speed by using the indptr
type of input. There won't be duplicates to sum. And since there are consistently 10 nonzero terms per row, generating the indptr
values won't be hard. But I can't do that off the top of my head. (oops, that's the transpose).
random
sparse to csr
is just a bit slower:
In [11]: %timeit sparse.random(1000, 500000, .01, 'csr')
1 loop, best of 3: 28.3 s per loop
Upvotes: 1