DGoiko
DGoiko

Reputation: 378

Efficient random sampling

My problem is simple: I've an array with 20 million floats. In that array, every float has a probability p of being randomly altered.

The simples way to do so is to move through the array, doing if (rand(0,1) < p) then modify.

However, even paralelizing, its slow as hell, and I was thinking if there's a faster way to randomly obtain some indexes to modify.

My first thought was to pick up p * n random numbers, where n is the total number of floats in the array, however, that doesnt exactly represent the probability distribution, as nothing in the first case guarantees that only p*n floats will be modified.

Ideas?

PD: I'm using python for the implementation, probably someone had this problem before and implemented something in the libraries, but I cannot find it.

Upvotes: 1

Views: 1785

Answers (5)

rici
rici

Reputation: 241671

If p is small, you can save a lot of time by using numpy.random.geometric to provide a sample of distances between elements being altered.

A simple pass over the array ary:

from numpy.random import geometric

index = -1
while True:
  index += geometric(0.01)
  if index >= len(ary):
    break
  ary[ind] = # compute new value

Numpy distribution functions can produce an array of return values, so as long as p is small, it's probably even faster to produce all the step values at once:

from numpy import cumsum
from numpy.random import geometric

for index in cumsum(geometric(p, size=int(len(ary) * p * 1.1))):
  if index < len(ary):
    ary[index] = # compute new value

The 1.1 is a fudge factor to make sure enough samples are selected from the geometric distribution. For large arrays it should be fine, but it is not guaranteed. A better (albeit more complicated) solution would be to generate the samples in chunks of, say, 10000 and keep doing that until you manage to get to the end of the array.

Upvotes: 0

Xavier Yin
Xavier Yin

Reputation: 56

This runs ~4 sec / round on my machine

import random

rand = random.random
p = 0.1
TOTAL_ROUND = 10

x = [rand() for i in xrange(20000000)]

for i in range(TOTAL_ROUND):
    print "round", i
    x = [rand() if val < p else val for val in x]

Upvotes: 0

CodeZero
CodeZero

Reputation: 1689

You could generate a random array with values in [0,1) of the same size n as your data vector using

rnd = np.random.rand(n)

Now you check at which indices these random values are smaller than p

mask = rnd < p

and now change the data at all indices that have been included by the mask, e.g.:

data[mask]=np.random.rand(data[mask].size)  

or using any method you want to alter the data.

Upvotes: 1

DYZ
DYZ

Reputation: 57033

Your array:

array = np.random.random(size=100) # Whatever

An array of random 0/1's:

p = 0.05 # Could be an array itself
markers = np.random.binomial(1, p, array.shape[0])

An array of indexes of the values to be modified:

locations = np.where(markers)[0]
# Something like array([19, 29, 32, 67, 68, 71])

You loop through the original array using those indexes or modify all values at once with something like array[locations] = ...

Upvotes: 0

DSM
DSM

Reputation: 352989

First, if p is high, i.e. >= 0.5, you won't save much time whatever you're doing because you're still likely to visit most of the elements. If p is lower, though, you can draw from a binomial distribution with n=20M and your probability to determine how many elements to touch.

In [23]: np.random.binomial(20*10**6, 0.1)
Out[23]: 1999582

In [24]: np.random.binomial(20*10**6, 0.99999)
Out[24]: 19999801

In [25]: np.random.binomial(20*10**6, 0.5)
Out[25]: 10001202

In [26]: np.random.binomial(20*10**6, 0.0001)
Out[26]: 1986
[...]
In [30]: np.random.binomial(20*10**6, 0.0001)
Out[30]: 1989

In [31]: np.random.binomial(20*10**6, 0.0001)
Out[31]: 1988

This number is the number of successes assuming n trials each with p chance of success, which is your situation exactly.

Upvotes: 4

Related Questions