user1131274
user1131274

Reputation: 473

Slow random sample generation without replacement in scipy

I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.

:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object. 

Here is my attempt:

start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0  ], [0.5, 0.5]),name = 'dist')

data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
  col[i*self._s:(i+1)*self._s]=i

row = numpy.empty(self._s*self._n)

print time.time()-start_time

for i in range(self._n):
  row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)

S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))

print time.time()-start_time

return S

Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.

Upvotes: 0

Views: 237

Answers (1)

hpaulj
hpaulj

Reputation: 231605

col = numpy.empty(self._s*self._n)
for i in range(self._n):
  col[i*self._s:(i+1)*self._s]=i

looks like something that could be written as one non-looping expression; though it probably isn't a big time consumer

My first guess is - but I need to play with this to be sure; I think it's assigning all rows the column index number.

col = np.empty(self._s, self._n)
col[:,:] = np.arange(self._n)
col = col.ravel()

Something similar for:

for i in range(self._n):
    row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)

is, I think, picking _s values from _d _n times. Doing the no-replace along _s, but allowing replace on _n could be tricky.

Without running the code myself (with smaller n) I'm stumbling around a bit. Which is the slow part, generating col, row, or the final csr? Iteration on n=500000 is going to be slow.

The matrix will be (1000, 500000), but with (10*500000) nonzero items. So a sparsity of .01. Just for comparison it would be interesting to generate a sparse random matrix of similar size and sparsity

In [5]: %timeit sparse.random(1000, 500000, .01)
1 loop, best of 3: 24.6 s per loop

and the dense random choices:

In [8]: timeit np.random.choice(1000,(10,500000)).shape
10 loops, best of 3: 53 ms per loop
In [9]: np.array([np.random.choice(1000,(10,)) for i in range(500000)]).shape
Out[9]: (500000, 10)
In [10]: timeit np.array([np.random.choice(1000,(10,)) for i in range(500000)]).
    ...: shape
1 loop, best of 3: 12.7 s per loop

So, yes, the large iteration loop is expensive. But given the replacement policy there might not be a way around that. Or is there?

So as first guess, creating row takes half the time, creating the sparse matrix the other half. I'm not surprised. You are using the coo style of input, which requires lexsorting and summing for duplicates when converting to csr. We might be able to gain speed by using the indptr type of input. There won't be duplicates to sum. And since there are consistently 10 nonzero terms per row, generating the indptr values won't be hard. But I can't do that off the top of my head. (oops, that's the transpose).

random sparse to csr is just a bit slower:

In [11]: %timeit sparse.random(1000, 500000, .01, 'csr')
1 loop, best of 3: 28.3 s per loop

Upvotes: 1

Related Questions