Reputation: 3073
I needed a custom Zipf-like number generator because numpy.random.zipf
function doesn't achieve what I need. Firstly, its alpha
must be greater than 1.0
and I need an alpha of 0.5
. Secondly, its cardinality is directly related to the sample size and I need to make more samples than the cardinality, e.g. make a list of 1000 elements from a Zipfian distribution of only 6 unique values.
@stanga posted a great solution to this.
import random
import bisect
import math
class ZipfGenerator:
def __init__(self, n, alpha):
# Calculate Zeta values from 1 to n:
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
# Store the translation map:
self.distMap = [x / zeta[-1] for x in zeta]
def next(self):
# Take a uniform 0-1 pseudo-random value:
u = random.random()
# Translate the Zipf variable:
return bisect.bisect(self.distMap, u) - 1
The alpha
can be less than 1.0
and the sampling can be infinite for a fixed cardinality n
. The problem is that it runs too slow.
# Calculate Zeta values from 1 to n:
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
These two lines are the culprits. When I choose n=50000
I can generate my list in ~10 seconds. I need to execute this when n=5000000
but it's not feasible. I don't fully understand why this is performing so slow because (I think) it has linear complexity and the floating point operations seem simple. I am using Python 2.6.6 on a good server.
Is there an optimization I can make or a different solution altogether that meet my requirements?
EDIT: I'm updating my question with a possible solution using modifications recommended by @ev-br . I've simplified it as a subroutine that returns the entire list. @ev-br was correct to suggest changing bisect
for searchssorted
as the former proved to be a bottleneck as well.
def randZipf(n, alpha, numSamples):
# Calculate Zeta values from 1 to n:
tmp = numpy.power( numpy.arange(1, n+1), -alpha )
zeta = numpy.r_[0.0, numpy.cumsum(tmp)]
# Store the translation map:
distMap = [x / zeta[-1] for x in zeta]
# Generate an array of uniform 0-1 pseudo-random values:
u = numpy.random.random(numSamples)
# bisect them with distMap
v = numpy.searchsorted(distMap, u)
samples = [t-1 for t in v]
return samples
Upvotes: 3
Views: 1586
Reputation: 26070
Let me take a small example first
In [1]: import numpy as np
In [2]: import math
In [3]: alpha = 0.1
In [4]: n = 5
In [5]: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
In [6]: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
In [7]: tmp
Out[7]:
[1.0,
0.9330329915368074,
0.8959584598407623,
0.8705505632961241,
0.8513399225207846]
In [8]: zeta
Out[8]:
[0,
1.0,
1.9330329915368074,
2.82899145137757,
3.699542014673694,
4.550881937194479]
Now, let's try to vectorize it, starting from innermost operations. The reduce
call is essentially a cumulative sum:
In [9]: np.cumsum(tmp)
Out[9]: array([ 1. , 1.93303299, 2.82899145, 3.69954201, 4.55088194])
You want a leading zero, so let's prepend it:
In [11]: np.r_[0., np.cumsum(tmp)]
Out[11]:
array([ 0. , 1. , 1.93303299, 2.82899145, 3.69954201,
4.55088194])
Your tmp
array can be constructed in one go as well:
In [12]: tmp_vec = np.power(np.arange(1, n+1) , -alpha)
In [13]: tmp_vec
Out[13]: array([ 1. , 0.93303299, 0.89595846, 0.87055056, 0.85133992])
Now, quick-and-dirty timings
In [14]: %%timeit
....: n = 1000
....: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
....: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
....:
100 loops, best of 3: 3.16 ms per loop
In [15]: %%timeit
....: n = 1000
....: tmp_vec = np.power(np.arange(1, n+1) , -alpha)
....: zeta_vec = np.r_[0., np.cumsum(tmp)]
....:
10000 loops, best of 3: 101 µs per loop
Now, it gets better with increasing n
:
In [18]: %%timeit
n = 50000
tmp_vec = np.power(np.arange(1, n+1) , -alpha)
zeta_vec = np.r_[0, np.cumsum(tmp)]
....:
100 loops, best of 3: 3.26 ms per loop
As compared to
In [19]: %%timeit
n = 50000
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
....:
1 loops, best of 3: 7.01 s per loop
Down the line, the call to bisect
can be replaced by np.searchsorted
.
EDIT: A couple of comments which are not directly relevant to the original question, and are rather based on my guesses of what can trip you down the line:
np.random.seed
, but better make it an explicit argument defaulting to None
(meaning do not seed it.)samples = [t-1 for t in v]
is not needed, just return v-1
.scipy.stats.rv_discrete
is doing. If you only need sampling, you're fine. If you need a full-fledged distribution, you may look into using it. Upvotes: 4