Ruggero Turra
Ruggero Turra

Reputation: 17740

mean of array of array

I have a structure like this:

a = np.array([[np.array([1,2,3]), np.array([2])], [np.array([0,1,2,3,4]), np.array([0,4])]])

it is a 2x2 structure, every element can have arbitrary dimension

I want to get the mean of every element of a:

np.mean(a)

ValueError: operands could not be broadcast together with shapes (3) (5) 

I tried to play with axis, but it doesn't work. I want to get a new np.array equal to [[2, 2], [2, 2].

In general I want to be able to run any vectorized function on a in the same way.

How to do it? I need fast code, so please avoid explicit loops.

The best I can do is:

f = np.mean
result = np.zeros(a.shape)
for i in np.ndindex(a.shape):
  result[i] = f(a[i])

Upvotes: 1

Views: 1098

Answers (3)

abarnert
abarnert

Reputation: 366133

From your comment: you have a relatively small number of quite big elements. That means the speed of the outer loop's iteration is irrelevant, while the speed of the inner loops' iterations is crucial.

Let's put some actual numbers on this. Your outer array has up to 4 dimensions of up to size 10. This means there are up to 10000 element. Meanwhile, the elements is "quite big"—let's interpret that conservatively as just 50. So, you've got 510000 loop iterations. Anything you do to improve the speed of the 10000 outer iterations will make less than a 2% difference in your code. In fact, it's far less than that—that 2% assumes there is no work to be done other than the iteration itself, which obviously isn't true.

So, you're focusing on the wrong place. Only the 500000 inner iterations matter. If you could replace the array of arrays with a single one-dimension-higher array and do it all in numpy, it might be faster, but making your code much more complex and hard to understand for a gain on the order of a fraction of 1% is silly. Just use the vectorize answer or comprehension answer, which are simple and obvious.


Meanwhile:

Probably I should try to parallelize the evaluation, using a thread for every matrix element.

Parallelism is a good idea here. But not using threads, and not one per element.

If you have, say, an 8-core machine, using 8 hardware threads means you get things done nearly 8 times as fast. But using 10000 hardware threads does not mean you get things done 10000 times as fast, or even 8 times as fast—you probably spend so much time context-switching, allocating and freeing thread stacks, etc. that it actually takes longer than the sequential version. So, you want to create a worker pool of 8 hardware threads, and stick your 10000 tasks in a queue to be run by that pool.

Also, 10000 is probably way too granular. Each job has a little bit of overhead, and if your jobs are too small, you're wasting too much time on overhead. The obvious ways to batch things up are per axis—have 1000 jobs, each doing one row of 10 elements, or 100 jobs, each doing one 2D array of 100 elements. Testing out batch sizes of 0, 1, 2, and 3 axes and see which gives the best performance. If the differences are huge, you may want to try slightly more complex batching, like splitting into 3D arrays of 3x10x10 and 4x10x10.

Finally, while Python threads are real OS (and therefore hardware) threads, the GIL prevents more than one of them from running Python code at a time. numpy does some work to get around this, but it's not perfect (especially if you have a lot of overhead between the numpy calls). The simplest way around this is to use multiprocessing, which lets you have a pool of 8 separate processes, instead of 8 threads in 1 process. This ramps up the overhead significantly—you either need to copy the sub-arrays (either implicitly, as the parameters and return values of your task functions, or explicitly, over a pipe) or put them in shared memory. But if your task sizes are large enough, it's usually not a problem.


Here's an example of how to parallelize your existing code:

f = np.mean
result = np.zeros(a.shape)
future_map = {}
with futures.ProcessPoolExecutor(max_workers=8) as executor:
    for i in np.ndindex(a.shape):
        future_map[executor.submit(f, a[i])] = i
    for future in futures.as_completed(future_map):
        i = future_map[future]
        result[i] = future.result()

Obviously, you could simplify it (e.g., replace the submission loop with a dict comprehension), but I wanted to make it as obvious as possible what you need to change.

Also, I'm using futures (which requires 3.2+; if you're on 2.7, install the backport from PyPI) because it makes the code a little simpler; if you need more flexibility, you want the more complex multiprocessing library.

And finally, I'm not doing any batching—each task is a single element—so the overhead is probably going to be pretty bad.

But start with this, simplify it as far as you're comfortable with, then convert it to use batches of 1, 2, and 3 axes, etc. as described earlier.

Upvotes: 2

askewchan
askewchan

Reputation: 46578

This is the best I can do:

a = np.array([[np.array([1,2,3]), np.array([2])], [np.array([0,1,2,3,4]), np.array([0,4])]])
np.array([ np.mean(b) for b in a.ravel()]).reshape(a.shape)

outputs

array([[ 2.,  2.],
       [ 2.,  2.]])

Can be generalized for other functions as:

def ravel_func(func,arr):
    return np.asarray([ func(part) for part in arr.ravel()]).reshape(arr.shape)

Speed test, thanks to Jaime

In [82]: timeit vmean(a)
10000 loops, best of 3: 65 us per loop

In [83]: timeit ravel_func(np.mean,a)
10000 loops, best of 3: 67 us per loop

Upvotes: 3

wim
wim

Reputation: 363556

I guess you want numpy.vectorize which is basically like the builtin map for ndarrays.

>>> a = np.array([[np.array([1,2,3]), np.array([2])], [np.array([0,1,2,3,4]), np.array([0,4])]])
>>> vmean = np.vectorize(np.mean)
>>> vmean(a)
array([[ 2.,  2.],
       [ 2.,  2.]])

Upvotes: 3

Related Questions