Slow random sample generation without replacement in scipy

Question

I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.

:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object.

Here is my attempt:

start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0  ], [0.5, 0.5]),name = 'dist')

data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
  col[i*self._s:(i+1)*self._s]=i

row = numpy.empty(self._s*self._n)

print time.time()-start_time

for i in range(self._n):
  row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)

S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))

print time.time()-start_time

return S

Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.

Slow random sample generation without replacement in scipy

Answers (1)

Related Questions