Reputation: 378
My problem is simple: I've an array with 20 million floats. In that array, every float has a probability p of being randomly altered.
The simples way to do so is to move through the array, doing if (rand(0,1) < p) then modify.
However, even paralelizing, its slow as hell, and I was thinking if there's a faster way to randomly obtain some indexes to modify.
My first thought was to pick up p * n random numbers, where n is the total number of floats in the array, however, that doesnt exactly represent the probability distribution, as nothing in the first case guarantees that only p*n floats will be modified.
Ideas?
PD: I'm using python for the implementation, probably someone had this problem before and implemented something in the libraries, but I cannot find it.
Upvotes: 1
Views: 1785
Reputation: 241671
If p is small, you can save a lot of time by using numpy.random.geometric
to provide a sample of distances between elements being altered.
A simple pass over the array ary:
from numpy.random import geometric
index = -1
while True:
index += geometric(0.01)
if index >= len(ary):
break
ary[ind] = # compute new value
Numpy distribution functions can produce an array of return values, so as long as p is small, it's probably even faster to produce all the step values at once:
from numpy import cumsum
from numpy.random import geometric
for index in cumsum(geometric(p, size=int(len(ary) * p * 1.1))):
if index < len(ary):
ary[index] = # compute new value
The 1.1 is a fudge factor to make sure enough samples are selected from the geometric distribution. For large arrays it should be fine, but it is not guaranteed. A better (albeit more complicated) solution would be to generate the samples in chunks of, say, 10000 and keep doing that until you manage to get to the end of the array.
Upvotes: 0
Reputation: 56
This runs ~4 sec / round on my machine
import random
rand = random.random
p = 0.1
TOTAL_ROUND = 10
x = [rand() for i in xrange(20000000)]
for i in range(TOTAL_ROUND):
print "round", i
x = [rand() if val < p else val for val in x]
Upvotes: 0
Reputation: 1689
You could generate a random array with values in [0,1)
of the same size n
as your data vector using
rnd = np.random.rand(n)
Now you check at which indices these random values are smaller than p
mask = rnd < p
and now change the data at all indices that have been included by the mask, e.g.:
data[mask]=np.random.rand(data[mask].size)
or using any method you want to alter the data.
Upvotes: 1
Reputation: 57033
Your array:
array = np.random.random(size=100) # Whatever
An array of random 0/1's:
p = 0.05 # Could be an array itself
markers = np.random.binomial(1, p, array.shape[0])
An array of indexes of the values to be modified:
locations = np.where(markers)[0]
# Something like array([19, 29, 32, 67, 68, 71])
You loop through the original array using those indexes or modify all values at once with something like array[locations] = ...
Upvotes: 0
Reputation: 352989
First, if p is high, i.e. >= 0.5, you won't save much time whatever you're doing because you're still likely to visit most of the elements. If p is lower, though, you can draw from a binomial distribution with n=20M and your probability to determine how many elements to touch.
In [23]: np.random.binomial(20*10**6, 0.1)
Out[23]: 1999582
In [24]: np.random.binomial(20*10**6, 0.99999)
Out[24]: 19999801
In [25]: np.random.binomial(20*10**6, 0.5)
Out[25]: 10001202
In [26]: np.random.binomial(20*10**6, 0.0001)
Out[26]: 1986
[...]
In [30]: np.random.binomial(20*10**6, 0.0001)
Out[30]: 1989
In [31]: np.random.binomial(20*10**6, 0.0001)
Out[31]: 1988
This number is the number of successes assuming n trials each with p chance of success, which is your situation exactly.
Upvotes: 4