Reputation: 627
Here is a function for the dot product between consecutive segments of a trajectory (xy coordinates). The result is as expected but the "for loop" makes it very slow.
In [94]:
def func1(xy, s):
size = xy.shape[0]-2*s
out = np.zeros(size)
for i in range(size):
p1, p2 = xy[i], xy[i+s] #segment 1
p3, p4 = xy[i+s], xy[i+2*s] #segment 2
out[i] = np.dot(p1-p2, p4-p3)
return out
xy = np.array([[1,2],[2,3],[3,4],[5,6],[7,8],[2,4],[5,2],[9,9],[1,1]])
func1(xy, 2)
Out[94]:
array([-16., 15., 32., 31., -14.])
I looked for a way to vectorize the above, hoping to make it much faster. Here is what I came up with:
In [95]:
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
return np.diagonal(np.dot((p1-p2), (p4-p3).T))
func2(xy, 2)
Out[95]:
array([-16, 15, 32, 31, -14])
Unfortunately, the dot product produces a square matrix from which I have to take the diagonal:
In [96]:
print np.dot((p1-p2), (p4-p3).T)
np.diagonal(np.dot((p1-p2), (p4-p3).T))
[[-16 10 16 -24 10]
[-24 15 24 -36 15]
[-32 20 32 -48 20]
[ 20 -13 -18 31 -14]
[ 32 -18 -40 44 -14]]
Out[96]:
array([-16, 15, 32, 31, -14])
My solution is really awful. It improves speed by only a ratio of 2 and more importantly it is now not scalable. My average trajectories have several ten thousands points which means that I'll have to deal with huge matrices.
Do you guys know of a better way? Thank you
EDIT: Awesome! einsum is definitively the solution. In my frustration I wrote the dot product myself. I know, not very readable and it defeats the purpose of using optimized libraries but here it is anyway (func4). Speed is on par with einsum.
def func4(xy, s):
size = xy.shape[0]-2*s
tmp1 = xy[0:size] - xy[s:size+s]
tmp2 = xy[2*s:size+2*s] - xy[s:size+s]
return tmp1[:, 0] * tmp2[:, 0] + tmp1[:, 1] * tmp2[:, 1]
Upvotes: 3
Views: 178
Reputation: 231385
einsum
is a good tool for generalizing the dot
product. Playing around with it, I can reproduce your numbers with:
np.einsum('ij,ij->i',p1-p2,p4-p3)
'ij,kj' produces dot(p1-p2, (p4-p3).T)
; the 'i...,i...->i' does the diagonal - all in one step.
as a spin off from your cross-product question I tried
tmp11,tmp21),tmp11[:,0]*tmp21[:,0]+tmp11[:,1]*tmp21[:,1])
For 5000 row arrays it is almost 2x the speed of the einsum calc.
Upvotes: 3
Reputation: 879661
Your idea in func2
leads naturally to using np.einsum
.
The good part of func2
is that it only computes p1
, p2
, p3
, p4
once as
larger arrays instead of small pieces as in func1
.
The bad part of func2
is that it does a lot of dot products that you don't
care about.
That's where einsum
comes in. It is a more flexible version of np.dot
.
Whenever you are computing a sum of products, think about using np.einsum
. It
is likely to be among the fastest (if not the fastest) way to compute the
quantity using NumPy.
def func3(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
return np.einsum('ij,ij->i', p1-p2, p4-p3)
The subscript string 'ij,ij->i'
has the following meaning:
There are two parts to the subscript string 'ij,ij->i'
: before the arrow
(->
), on the left, ij,ij
and after the arrow, i
.
On the left, the ij
before the comma refers to the subscripts for p1-p2
, and
the ij
after the comma refer to the subscripts for p4-p3
.
The Einstein summation notation sums over repeated subscripts that do not appear after the arrow. In this case, j
is repeated and does not appear after the arrow.
So for each i
, the sum (p1-p2)[i,j]*(p4-p3)[i,j]
is computed, where the sum runs over all j
. The result is an array indexed by i
.
Sanity check:
In [90]: np.allclose(func1(xy, 2), func3(xy, 2))
Out[90]: True
Here is a benchmark: on an array xy
of shape (9000, 2), showing using einsum
is 450x faster that func1 and 7470x faster than func2
:
In [13]: xy = np.tile(xy, (1000,1))
In [14]: %timeit func1(xy, 2)
10 loops, best of 3: 42.1 ms per loop
In [15]: %timeit func2(xy, 2)
1 loops, best of 3: 686 ms per loop
In [16]: %timeit func3(xy, 2)
10000 loops, best of 3: 91.8 µs per loop
The OP's func4
performs even better than func3
!
In [92]: %timeit func4(xy, 2)
10000 loops, best of 3: 74.1 µs per loop
I think the reason why func4
is beating einsum
here is because the cost of setting of the loop in einsum
for just 2 iterations is too expensive compared to just manually writing out the sum.
Upvotes: 5