Reputation: 47
I have a 2d array, say, a = [ [1, 2, 3, 4], [5, 6,7, 8], [9, 10, 11, 12], ...[21, 22, 23, 24] ]
, and I would like to pick N
elements from each row randomly, based on a probability distribution p
which can be different for each row.
So basically, I'd like to do something like [ np.random.choice(a[i], N, p=p_arr[i]) for i in range(a.shape[0]) ]
without using a loop, where p_arr
is a 2d array the same shape as a
. p_arr
stores the probability distribution for each row in a
.
The reason I want to avoid using a for loop is because running a line profiler shows that the loop is slowing down my code by a lot (I have large arrays to work with).
Is there a more python-ic way of doing this?
I checked out these links (here and here) but they don't answer my question.
Thank you!
An example of what I'd like to do without the loop:
a = np.ones([500, 500])
>>> p_arr = np.identity(a.shape[0])
>>> for i in range(a.shape[0]):
... a[i] = a[i]*np.arange(a.shape[0])
...
>>> [print(np.random.choice(a[i], p =p_arr[i])) for i in range(a.shape[0])]
Upvotes: 2
Views: 1508
Reputation: 42143
Perhaps using a list comprehension instead of a loop would address the issue:
import numpy as np
shape = (10,10)
N = 4
distributions = np.random.rand(*shape)
distributions = distributions/(np.sum(distributions,axis=1)[:,None])
values = np.arange(shape[0]*shape[1]).reshape(shape)
sample = np.array([np.random.choice(v,N,p=r) for v,r in zip(values,distributions)])
output:
print(np.round(distributions,2))
[[0.03 0.22 0.1 0.09 0.2 0.1 0.11 0.05 0.08 0.01]
[0.04 0.12 0.13 0.03 0.16 0.22 0.16 0.05 0. 0.09]
[0.15 0.04 0.08 0.07 0.17 0.13 0.01 0.15 0.1 0.1 ]
[0.06 0.13 0.16 0.03 0.17 0.09 0.08 0.11 0.05 0.12]
[0.07 0.08 0.09 0.08 0.13 0.18 0.12 0.13 0.07 0.07]
[0.1 0.04 0.11 0.06 0.04 0.16 0.18 0.15 0.01 0.15]
[0.06 0.09 0.17 0.08 0.14 0.15 0.09 0.01 0.06 0.15]
[0.03 0.1 0.11 0.07 0.14 0.14 0.15 0.1 0.04 0.11]
[0.05 0.1 0.18 0.1 0.03 0.18 0.12 0.05 0.05 0.13]
[0.13 0.1 0.08 0.11 0.06 0.14 0.11 0. 0.14 0.14]]
print(sample)
[[ 6 4 8 5]
[16 19 15 10]
[25 20 24 23]
[37 34 30 31]
[41 44 46 45]
[59 55 53 57]
[64 63 65 61]
[79 75 76 77]
[85 81 83 88]
[99 96 93 90]]
If you want non-repeating samples on each line, there is another kind of optimization that you could try. By flattening the values and the distributions, you can create a non-repeating shuffle of indexes of the whole matrix according to the respective distributions of each line. With the flattened distributions, each group of values that belong to the same line will have (as a group) an equivalent distribution. This means that, if you reassemble the shuffled indexes on their original lines but keeping their stable shuffled ordered, you can then take a slice of the shuffle matrix to obtain your sample:
flatDist = distributions.reshape((distributions.size,))
flatDist = flatDist/np.sum(flatDist)
randomIdx = np.random.choice(np.arange(values.size),flatDist.size,replace=False,p=flatDist)
shuffleIdx = np.array([randomIdx//shape[1],randomIdx%shape[1]])
shuffleIdx = shuffleIdx[:,np.argsort(shuffleIdx[0,:],kind="stable")]
sample = values[tuple(shuffleIdx)].reshape(shape)[:,:N]
output:
print(sample)
[[ 3 7 2 5]
[13 12 14 16]
[27 23 25 29]
[37 31 33 36]
[47 45 48 49]
[59 50 52 54]
[62 61 60 66]
[72 78 70 77]
[87 82 83 86]
[92 98 95 93]]
Upvotes: 1
Reputation: 123
This can be used instead of using the loop.
a = np.ones([500, 500])
p_arr = np.identity(a.shape[0])
a2 = a.flatten()
a3 = a2*np.full(shape=a.shape, fill_value=np.arange(a.shape[0])).flatten()
p_arr3 = p_arr.flatten()/a.shape[1]
print(np.random.choice(a3, a.shape[1], p =p_arr3))
I had to use np.array.flatten()
multiple times to convert the 2D array into 1D array. Then we can avoid using the loop by performing our operations on the 1D array.
Upvotes: 0