create distribution from a list and generate random numbers which follow that distribution in Python

Question

Lets say I have a list of numbers (all numbers are within 0.5 to 1.5 in this particular example and of course it is a discrete set ).

my_list=  [0.564, 1.058, 0.779, 1.281, 0.656, 0.863, 0.958, 1.146, 0.742, 1.139, 0.957, 0.548, 0.572, 1.204, 0.868, 0.57, 1.456, 0.586, 0.718, 0.966, 0.625, 0.951, 0.766, 1.458, 0.83, 1.25, 0.7, 1.334, 1.015, 1.43, 1.376, 0.942, 1.252, 1.441, 0.795, 1.25, 0.851, 1.383, 0.969, 0.629, 1.008, 0.729, 0.841, 0.619, 0.63, 1.189, 0.514, 0.899, 0.807, 0.63, 1.101, 0.528, 1.385, 0.838, 0.538, 1.364, 0.702, 1.129, 0.639, 0.557, 1.28, 0.664, 1.021, 1.43, 0.792, 1.229, 0.837, 1.183, 0.54, 0.831, 1.279, 1.385, 1.377, 0.827, 1.32, 0.537, 1.19, 1.446, 1.222, 0.762, 1.302, 0.626, 1.352, 1.316, 1.286, 1.239, 1.027, 1.198, 0.961, 0.515, 0.989, 0.979, 1.123, 0.889, 1.484, 0.734, 0.718, 0.758, 0.782, 1.163, 0.579, 0.744, 0.711, 1.13, 0.598, 0.913, 1.305, 0.684, 1.108, 1.373, 0.945, 0.837, 1.129, 1.005, 1.447, 1.393, 1.493, 1.262, 0.73, 1.232, 0.838, 1.319, 0.971, 1.234, 0.738, 1.418, 1.397, 0.927, 1.309, 0.784, 1.232, 1.454, 1.387, 0.851, 1.132, 0.958, 1.467, 1.41, 1.359, 0.529, 1.139, 1.438, 0.672, 0.756, 1.356, 0.736, 1.436, 1.414, 0.921, 0.669, 1.21, 1.041, 0.597, 0.541, 1.162, 1.292, 0.538, 1.011, 0.828, 1.356, 0.897, 0.831, 1.018, 1.412, 1.363, 1.371, 1.231, 1.278, 0.564, 1.134, 1.324, 0.593, 1.307, 0.66, 1.376, 1.469, 1.315, 0.959, 1.099, 1.313, 1.032, 1.128, 1.175, 0.64, 0.581, 1.09, 0.934, 0.698, 1.272]

I can make a histogram distribution plot from it as

hist(my_list, bins=20, range=[0.5,1.5])
show()

which produces enter image description here

Now, I want to create another list of random numbers (lets say this new list consists of 100 numbers) that will follow the same distribution (not sure how to link a discrete set in to a continuous distribution !!! ) as the old list ( my_list ) , so if I plot the histogram distribution from the new list, it will essentially produce the same histogram distribution.

Is there any way to do so in Python 2.7 ? I appreciate any help in advance.

Alex Martelli · Accepted Answer

You first need to "bucket up" the range of interest, and of course you can do it with tools from scipy &c, but for the sake of understanding what's going on a little Python version might help - with no optimizations, for ease of understanding:

import collections

def buckets(discrete_set, amin=None, amax=None, bucket_size=None):
    if amin is None: amin=min(discrete_set)
    if amax is None: amax=min(discrete_set)
    if bucket_size is None: bucket_size = (amax-amin)/20
    def to_bucket(sample):
        if not (amin <= sample <= amax): return None  # no bucket fits
        return int((sample - amin) // bucket_size)
    b = collections.Counter(to_bucket(s)
            for s in discrete_set if to_bucket(s) is not None)
    return amin, amax, bucket_size, b

So, now you have a Counter (essentially a dict) mapping each bucket from 0 up to its count as observed in the discrete set.

Next, you'll want to generate a random sample matching the bucket distribution measured by calling buckets(discrete_set). A Counter's elements method can help, but you need a list for random.sample...:

mi, ma, bs, bks = buckets(discrete_set) 
buckelems = list(bks.elements())

(this may waste a lot of space, but you can optimize it later, separately from this understanding-focused overview:-).

Now it's easy to get an N-sized sample, e.g:

def makesample(N, buckelems, mi, ma, bs):
    s = []
    for _ in range(N):
        buck = random.choice(buckelems)
        x = random.uniform(mi+buck*bs, mi+(buck+1)*bs)
        s.append(x)
    return s

Here I'm assuming the buckets are fine-grained enough that it's OK to use a uniform distribution within each bucket.

Now, optimizing this is of course interesting -- buckelems will have as many items as originally were in discrete_set, and if that imposes an excessive load on memory, cumulative distributions can be built and used instead.

Or, one could bypass the Counter altogether, and just "round" each item in the discrete set to its bucket's lower bound, if memory's OK but one wants more speed. Or, one could leave discrete_set alone and random.choice within it before "perturbing" the chosen value (in different ways depending on the constraints of the exact problem). No end of fun...!-)

create distribution from a list and generate random numbers which follow that distribution in Python

Answers (2)

Related Questions