GabeFS
GabeFS

Reputation: 76

Randomly grow values in a NumPy Array

I have a program that takes some large NumPy arrays and, based on some outside data, grows them by adding one to randomly selected cells until the array's sum is equal to the outside data. A simplified and smaller version looks like:

import numpy as np
my_array = np.random.random_integers(0, 100, [100, 100])
## Just creating a sample version of the array, then getting it's sum:
np.sum(my_array)
499097

So, supposing I want to grow the array until its sum is 1,000,000, and that I want to do so by repeatedly selecting a random cell and adding 1 to it until we hit that sum, I'm doing something like:

diff = 1000000 - np.sum(my_array)
counter = 0
while counter < diff:
    row = random.randrange(0,99)
    col = random.randrange(0,99)
    coordinate = [row, col]
    my_array[coord] += 1
    counter += 1

Where row/col combine to return a random cell in the array, and then that cell is grown by 1. It repeats until the number of times by which it has added 1 to a random cell == the difference between the original array's sum and the target sum (1,000,000).

However, when I check the result after running this - the sum is always off. In this case after running it with the same numbers as above:

np.sum(my_array)
99667203

I can't figure out what is accounting for this massive difference. And is there a more pythonic way to go about this?

Upvotes: 2

Views: 905

Answers (3)

grovesNL
grovesNL

Reputation: 6075

my_array[coordinate] does not do what you expect. It is selecting multiple rows and adding 1 to all of those entries. You could simply use my_array[row, col] instead.

You could simply write something like:

for _ in range(1000000 - np.sum(my_array)):
    my_array[random.randrange(0, 99), random.randrange(0, 99)] += 1

(or xrange instead of range if using Python 2.x)

Upvotes: 1

ali_m
ali_m

Reputation: 74154

The problem with your original approach is that you are indexing your array with a list, which is interpreted as a sequence of indices into the row dimension, rather than as separate indices into the row/column dimensions (see here). Try passing a tuple instead of a list:

coord = row, col
my_array[coord] += 1

A much faster approach would be to find the difference between the sum over the input array and the target value, then generate an array containing the same number of random indices into the array and increment them all in one go, thus avoiding looping in Python:

import numpy as np

def grow_to_target(A, target=1000000, inplace=False):

    if not inplace:
        A = A.copy()

    # how many times do we need to increment A?
    n = target - A.sum()

    # pick n random indices into the flattened array
    idx = np.random.random_integers(0, A.size - 1, n)

    # how many times did we sample each unique index?
    uidx, counts = np.unique(idx, return_counts=True) 

    # increment the array counts times at each unique index
    A.flat[uidx] += counts

    return A

For example:

a = np.zeros((100, 100), dtype=np.int)

b = grow_to_target(a)
print(b.sum())
# 1000000

%timeit grow_to_target(a)
# 10 loops, best of 3: 91.5 ms per loop

Upvotes: 0

Michael Bird
Michael Bird

Reputation: 78

Replace my_array[coord] with my_array[row][col]. Your method chose two random integers and added 1 to every entry in the rows corresponding to both integers.

Basically you had a minor misunderstanding of how numpy indexes arrays.

Edit: To make this clearer. The code posted chose two numbers, say 30 and 45, and added 1 to all 100 entries of row 30 and all 100 entries of row 45.

From this you would expect the total sum to be 100,679,697 = 200*(1,000,000 - 499,097) + 499,097

However when the random integers are identical (say, 45 and 45), only 1 is added to every entry of column 45, not 2, so in that case the sum only jumps by 100.

Upvotes: 0

Related Questions