Fomite
Fomite

Reputation: 2273

Vectorizing the addition of results to a numpy array

I have a function that works something like this:

def Function(x):
   a = random.random()
   b = random.random()
   c = OtherFunctionThatReturnsAThreeColumnArray()
   results = np.zeros((1,5))
   results[0,0] = a
   results[0,1] = b
   results[0,2] = c[-1,0]
   results[0,3] = c[-1,1]
   results[0,4] = c[-1,2]
   return results

What I'm trying to do is run this function many, many times, appending the returned one row, 5 column results to a running data set. But the append function, and a for-loop are both ruinously inefficient as I understand it, and I'm both trying to improve my code and the number of runs is going to be large enough that that kind of inefficiency isn't doing me any favors.

Whats the best way to do the following such that it induces the least overhead:

  1. Create a new numpy array to hold the results
  2. Insert the results of N calls of that function into the array in 1?

Upvotes: 3

Views: 194

Answers (1)

danodonovan
danodonovan

Reputation: 20373

You're correct in thinking that numpy.append or numpy.concatenate are going to be expensive if repeated many times (this is to do with numpy declaring a new array for the two previous arrays).

The best suggestion (If you know how much space you're going to need in total) would be to declare that before you run your routine, and then just put the results in place as they become available.

If you're going to run this nrows times, then

results = np.zeros([nrows, 5])

and then add your results

def function(x, i, results):
    <.. snip ..>
    results[i,0] = a
    results[i,1] = b
    results[i,2] = c[-1,0]
    results[i,3] = c[-1,1]
    results[0,4] = c[-1,2]

Of course, if you don't know how many times you're going to be running function this won't work. In that case, I'd suggest a less elegant approach;

  1. Declare a possibly large results array and add to results[i, x] as above (keeping track of i and the size of results.

  2. When you reach the size of results, then do the numpy.append (or concatenate) on a new array. This is less bad than appending repetitively and shouldn't destroy performance - but you will have to write some wrapper code.

There are other ideas you could pursue. Off the top of my head you could

  1. Write the results to disk, depending on the speed of OtherFunctionThatReturnsAThreeColumnArray and the size of your data this may not be too daft an idea.

  2. Save your results in a list comprehension (forgetting numpy until after the run). If function returned (a, b, c) not results;

results = [function(x) for x in my_data]

and now do some shuffling to get results into the form you need.

Upvotes: 2

Related Questions