Reputation: 459
I have a code in python with the following elements:
I have an intensities
vector which is something like this:
array([ 1142., 1192., 1048., ..., 29., 18., 35.])
I have also an x
vector which looks like this:
array([ 0, 1, 1, ..., 1060, 1060, 1061])
Then, I have the for loop where I fill another vector, radialDistribution
like this:
for i in range(1000):
radialDistribution[i] = sum(intensities[np.where(x == i)]) / len(np.where(x == i)[0])
The problem is that it takes 20 second to complete it...therefore I want to vectorize it. But I am quite new with broadcasting in Numpy and didn't find so much out there...therefore I need your help.
I tried this, but didn't work:
i= np.ogrid[:1000]
intensities[i] = sum(sortedIntensities1D[np.where(sortedDists1D == i)]) / len(np.where(sortedDists1D == i)[0])
Could you help me just telling me where should I look to learn the vectorization procedures with Numpy?
Thanks in advance for your valuable help!
Upvotes: 3
Views: 1019
Reputation: 231335
Building on my itertools.groupby
solution in https://stackoverflow.com/a/22265803/901925 here's a solution that works on 2 small arrays.
import numpy as np
import itertools
intensities = np.arange(12,dtype=float)
x=np.array([1,0,1,2,2,1,0,0,1,2,1,0]) # general, not sorted or consecutive
first a bincount solution, adjusted for nonconsecutive values
# using bincount
# if 'x' are not consecutive
J=np.bincount(x)>0
print np.bincount(x,weights=intensities)[J]/np.bincount(x)[J]
Now a groupby
solution
# using groupby;
# sort if need
I=np.argsort(x)
x=x[I]
intensities=intensities[I]
# make a record array for use by groupby
xi=np.zeros(shape=x.shape, dtype=[('intensities',float),('x',int)])
xi['intensities']=intensities
xi['x']=x
g=itertools.groupby(xi, lambda z:z['x'])
xx=np.array([np.array([z[0] for z in y[1]]).mean() for y in g])
print xx
Here's a compact numpy
solution, using the return_index
option of np.unique
, and np.split
. x
should be sorted. I'm not optimistic about the speed for large arrays, since there will be iteration in unique
and split
in addition to the comprehension.
[values, index] = np.unique(x, return_index=True)
[y.mean() for y in np.split(intensities, index[1:])]
Upvotes: 0
Reputation: 10759
Here is my implementation of group_by functionality in numpy. It is conceptually similar to the pandas solution; except that this does not require pandas, and ought to become a part of the numpy core, in my opinion.
Using this functionality, your code would look like this:
radialDistribution = group_by(x).mean(intensities)
and would complete in notime.
Look also at the test_radial function defined at the end, which may come even closer to your endgoal.
Upvotes: 2
Reputation:
Here's a method that uses broadcasting:
# arrays need to be at least 2D for broadcasting
x = np.atleast_2d(x)
# create vector of indices
i = np.atleast_2d(np.arange(x.size))
# do the vectorized calculation
bool_eq = (x == i.T)
totals = np.sum(np.where(bool_eq, intensities, 0), axis=1)
rD = totals / np.sum(bool_eq, axis=1)
This uses broadcasting two times: in the operation x == i.T
and in the call to np.where
. Unfortunately the code above is very slow, even slower than the original. The main bottleneck here is np.where
, which we can speed up in this case by taking the product of the Boolean array and the intensities (also by broadcasting):
totals = np.sum(bool_eq*intensities, axis=1)
And this is essentially the same as a matrix-vector product, so we can write:
totals = np.dot(intensities, bool_eq.T)
The end result is a faster code than the original (at least until the memory use for the intermediary array becomes the limiting factor), but you're probably better off with an iterative approach, as suggested by one of the other answers.
Edit: making use of np.einsum
was faster still (in my trial):
totals = np.einsum('ij,j', bool_eq, intensities)
Upvotes: 1
Reputation: 67427
If your x
vector has consecutive integers starting at 0, then you can simply do:
radialDistribution = np.bincount(x, weights=intensities) / np.bincount(x)
Upvotes: 5