Reputation: 459

Vectorization in Numpy - Broadcasting

I have a code in python with the following elements:

I have an intensities vector which is something like this:

array([ 1142.,  1192.,  1048., ...,    29.,    18.,    35.])

I have also an x vector which looks like this:

array([   0,    1,    1, ..., 1060, 1060, 1061])

Then, I have the for loop where I fill another vector, radialDistribution like this:

for i in range(1000):
    radialDistribution[i] = sum(intensities[np.where(x == i)]) / len(np.where(x == i)[0])

The problem is that it takes 20 second to complete it...therefore I want to vectorize it. But I am quite new with broadcasting in Numpy and didn't find so much out there...therefore I need your help.

I tried this, but didn't work:

i= np.ogrid[:1000]
intensities[i] = sum(sortedIntensities1D[np.where(sortedDists1D == i)]) / len(np.where(sortedDists1D == i)[0])

Could you help me just telling me where should I look to learn the vectorization procedures with Numpy?

Thanks in advance for your valuable help!

Upvotes: 3

Answers (4)

hpaulj

Reputation: 231335

Building on my itertools.groupby solution in https://stackoverflow.com/a/22265803/901925 here's a solution that works on 2 small arrays.

import numpy as np
import itertools
intensities = np.arange(12,dtype=float)
x=np.array([1,0,1,2,2,1,0,0,1,2,1,0]) # general, not sorted or consecutive

first a bincount solution, adjusted for nonconsecutive values

# using bincount
# if 'x' are not consecutive
J=np.bincount(x)>0
print np.bincount(x,weights=intensities)[J]/np.bincount(x)[J]

Now a groupby solution

# using groupby;
# sort if need
I=np.argsort(x)
x=x[I]
intensities=intensities[I]

# make a record array for use by groupby
xi=np.zeros(shape=x.shape, dtype=[('intensities',float),('x',int)])
xi['intensities']=intensities
xi['x']=x

g=itertools.groupby(xi, lambda z:z['x'])
xx=np.array([np.array([z[0] for z in y[1]]).mean() for y in g])
print xx

Here's a compact numpy solution, using the return_index option of np.unique, and np.split. x should be sorted. I'm not optimistic about the speed for large arrays, since there will be iteration in unique and split in addition to the comprehension.

[values, index] = np.unique(x, return_index=True)
[y.mean() for y in np.split(intensities, index[1:])]

Upvotes: 0

Eelco Hoogendoorn

Reputation: 10759

Here is my implementation of group_by functionality in numpy. It is conceptually similar to the pandas solution; except that this does not require pandas, and ought to become a part of the numpy core, in my opinion.

Using this functionality, your code would look like this:

radialDistribution = group_by(x).mean(intensities)

and would complete in notime.

Look also at the test_radial function defined at the end, which may come even closer to your endgoal.

Upvotes: 2

user2379410

Reputation:

Here's a method that uses broadcasting:

# arrays need to be at least 2D for broadcasting
x = np.atleast_2d(x)

# create vector of indices
i = np.atleast_2d(np.arange(x.size))

# do the vectorized calculation
bool_eq = (x == i.T)
totals = np.sum(np.where(bool_eq, intensities, 0), axis=1)
rD = totals / np.sum(bool_eq, axis=1)

This uses broadcasting two times: in the operation x == i.T and in the call to np.where. Unfortunately the code above is very slow, even slower than the original. The main bottleneck here is np.where, which we can speed up in this case by taking the product of the Boolean array and the intensities (also by broadcasting):

totals = np.sum(bool_eq*intensities, axis=1)

And this is essentially the same as a matrix-vector product, so we can write:

totals = np.dot(intensities, bool_eq.T)

The end result is a faster code than the original (at least until the memory use for the intermediary array becomes the limiting factor), but you're probably better off with an iterative approach, as suggested by one of the other answers.

Edit: making use of np.einsum was faster still (in my trial):

totals = np.einsum('ij,j', bool_eq, intensities)

Upvotes: 1

Jaime

Reputation: 67427

If your x vector has consecutive integers starting at 0, then you can simply do:

radialDistribution = np.bincount(x, weights=intensities) / np.bincount(x)

Upvotes: 5

Vectorization in Numpy - Broadcasting

Answers (4)

Related Questions