Reputation: 255
I have a list of items from which I want to randomly sample a subset, but each item is paired with a histogram over D bins and I want to sample the items in such a way that the summed histogram is approximately uniform.
Thus it should work as the sample function below:
>>> import numpy
>>> #The histograms from which to sample (each having 5 bins):
>>> data = numpy.random.randint(100, size=(10000,5))
>>> #The function which I'm trying to program:
>>> samples = sample(data,500)
>>> samples.shape
(500,5)
>>> summed_histogram = samples.sum(axis=0)
>>> #Each bin should have approximately equal value
>>> summed_histogram / float(summed_histogram.sum())
array([ 0.2, 0.2, 0.2, 0.2, 0.2])
The absolute values of the summed histogram are not important nor does it need to be exactly uniform, it just needs to be approximately uniform. Also, I don't care if the returned sample size is not exactly the specified sample size. The sampling should be without replacement.
Upvotes: 3
Views: 1370
Reputation: 36920
To expand on @Ilmari Karonen's solution, what you want to do is compute weights for each histogram and then sample according to those weights. It appears to me that the most efficient way to do this, given your goal, would be with a linear program.
Let D_ij be the weight of the jth bin in the histogram of the ith item. Then if each item is weighted with weight w_i, the "summed histogram" would have weight sum(i in items) w_i D_ij. One way to get your "approximately uniform" distribution would be to minimize the maximum difference across bins, so we would solve the following LP:
minimize z
subject to (for all j, k)
z >= (sum i in items) w_i D_ij - (sum i in items) w_i D_ik
z >= (sum i in items) w_i D_ik - (sum i in items) w_i D_ij
The above is basically saying that z >=
absolute value of difference across all weighted pairs of bins. To solve this LP you will need a separate package since numpy does not include a LP solver. See this gist for a solution using cplex
or this gist for a solution using cvxpy
. Note that you will need to set some constraints on the weights (e.g. each weight is larger or equal to 0) as these solutions do. Other python bindings for GLPK (GNU Linear Programming kit) can be found here: http://en.wikibooks.org/wiki/GLPK/Python.
Finally you just sample from histogram i
with weight w_i
. This can be done with an adaptation of roulette selection using cumsum
and searchsorted
as suggested by @Ilmari Karonen, see this gist.
If you wanted the resulting weighted distribution to be "as uniform as possible", I would solve a similar problem with weights, but maximize the weighted entropy across the weighted sum of bins. This problem would appear to be nonlinear although you could use any number of nonlinear solvers such as BFGS or gradient-based methods. This would probably be a bit slower than the LP method but it depends on what you need in your application. The LP method would approximate the nonlinear method very closely if you have a large number of histograms, because it would be easy to reach a uniform distribution.
When using the LP solution, a bunch of the histogram weights may bind to 0 because the number of constraints is small, but this will not be a problem with a non-trivial number of bins, since the number of constraints is O(n^2).
Example weights with 50 histograms and 10 bins:
[0.006123642775837011, 0.08591660144140816, 0.0, 0.0, 0.0, 0.0, 0.03407525280610657, 0.0, 0.0, 0.0, 0.07092537493489116, 0.0, 0.0, 0.023926802333318554, 0.0, 0.03941537854267549, 0.0, 0.0, 0.0, 0.0, 0.10937063438351756, 0.08715770469631079, 0.0, 0.05841899435928017, 0.016328676622408153, 0.002218517959171183, 0.0, 0.0, 0.0, 0.08186919626269101, 0.03173286609277701, 0.08737065271898292, 0.0, 0.0, 0.041505225727435785, 0.05033635148761689, 0.0, 0.09172214842175723, 0.027548495513552738, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0259929997624099, 0.0, 0.0, 0.028044483157851748, 0.0, 0.0, 0.0]
With 50 histograms each with 50 bins, now very few zero values:
[0.0219136051655165, 0.0, 0.028325808078797768, 0.0, 0.040889043180965624, 0.04372501089775975, 0.0, 0.031032870504105477, 0.020745831040881676, 0.04794861828714149, 0.0, 0.03763592540998652, 0.0029093177405377577, 0.0034239051136138398, 0.0, 0.03079554151573207, 0.0, 0.04676278554085836, 0.0461258666541918, 9.639105313353352e-05, 0.0, 0.013649362063473166, 0.059168272186891635, 0.06703936360466661, 0.0, 0.0, 0.03175895249795131, 0.0, 0.0, 0.04376133487616099, 0.02406633433758186, 0.009724226721798858, 0.05058252335384487, 0.0, 0.0393763638188805, 0.05287112817101315, 0.0, 0.0, 0.06365320629437914, 0.0, 0.024978299494456246, 0.023531082497830605, 0.033406648550332804, 0.012693750980220679, 0.00274892002684083, 0.0, 0.0, 0.0, 0.0, 0.04465971034045478, 4.888224154453002]
Upvotes: 3
Reputation: 5812
Could you draw a number of complete random samples (of 500), and then pick the one that is most uniform (i.e. lowest sample.sum(axis=0).std()
)? That avoids weird biases when drawing incremental samples.
Upvotes: 0