user3329302
user3329302

Reputation: 627

How to vectorize the dot product between segments of a trajectory

Here is a function for the dot product between consecutive segments of a trajectory (xy coordinates). The result is as expected but the "for loop" makes it very slow.

In [94]:
def func1(xy, s):
    size = xy.shape[0]-2*s
    out = np.zeros(size)
    for i in range(size):
        p1, p2 = xy[i], xy[i+s]     #segment 1
        p3, p4 = xy[i+s], xy[i+2*s] #segment 2
        out[i] = np.dot(p1-p2, p4-p3)
    return out

xy = np.array([[1,2],[2,3],[3,4],[5,6],[7,8],[2,4],[5,2],[9,9],[1,1]])
func1(xy, 2)

Out[94]:
array([-16.,  15.,  32.,  31., -14.])

I looked for a way to vectorize the above, hoping to make it much faster. Here is what I came up with:

In [95]:
def func2(xy, s):
    size = xy.shape[0]-2*s
    p1 = xy[0:size]
    p2 = xy[s:size+s]
    p3 = p2
    p4 = xy[2*s:size+2*s]
    return np.diagonal(np.dot((p1-p2), (p4-p3).T))

func2(xy, 2)

Out[95]:
array([-16,  15,  32,  31, -14])

Unfortunately, the dot product produces a square matrix from which I have to take the diagonal:

In [96]:
print np.dot((p1-p2), (p4-p3).T)
np.diagonal(np.dot((p1-p2), (p4-p3).T))

[[-16  10  16 -24  10]
 [-24  15  24 -36  15]
 [-32  20  32 -48  20]
 [ 20 -13 -18  31 -14]
 [ 32 -18 -40  44 -14]]

Out[96]:
array([-16,  15,  32,  31, -14])

My solution is really awful. It improves speed by only a ratio of 2 and more importantly it is now not scalable. My average trajectories have several ten thousands points which means that I'll have to deal with huge matrices.

Do you guys know of a better way? Thank you

EDIT: Awesome! einsum is definitively the solution. In my frustration I wrote the dot product myself. I know, not very readable and it defeats the purpose of using optimized libraries but here it is anyway (func4). Speed is on par with einsum.

def func4(xy, s):
    size = xy.shape[0]-2*s
    tmp1 = xy[0:size] - xy[s:size+s]
    tmp2 = xy[2*s:size+2*s] - xy[s:size+s]
    return tmp1[:, 0] * tmp2[:, 0] + tmp1[:, 1] * tmp2[:, 1]

Upvotes: 3

Views: 178

Answers (2)

hpaulj
hpaulj

Reputation: 231385

einsum is a good tool for generalizing the dot product. Playing around with it, I can reproduce your numbers with:

np.einsum('ij,ij->i',p1-p2,p4-p3)

'ij,kj' produces dot(p1-p2, (p4-p3).T); the 'i...,i...->i' does the diagonal - all in one step.


as a spin off from your cross-product question I tried

tmp11,tmp21),tmp11[:,0]*tmp21[:,0]+tmp11[:,1]*tmp21[:,1])

For 5000 row arrays it is almost 2x the speed of the einsum calc.

Upvotes: 3

unutbu
unutbu

Reputation: 879661

Your idea in func2 leads naturally to using np.einsum.

The good part of func2 is that it only computes p1, p2, p3, p4 once as larger arrays instead of small pieces as in func1.

The bad part of func2 is that it does a lot of dot products that you don't care about.

That's where einsum comes in. It is a more flexible version of np.dot. Whenever you are computing a sum of products, think about using np.einsum. It is likely to be among the fastest (if not the fastest) way to compute the quantity using NumPy.

def func3(xy, s):
    size = xy.shape[0]-2*s
    p1 = xy[0:size]
    p2 = xy[s:size+s]
    p3 = p2
    p4 = xy[2*s:size+2*s]
    return np.einsum('ij,ij->i', p1-p2, p4-p3)

The subscript string 'ij,ij->i' has the following meaning:

There are two parts to the subscript string 'ij,ij->i': before the arrow (->), on the left, ij,ij and after the arrow, i.

On the left, the ij before the comma refers to the subscripts for p1-p2, and the ij after the comma refer to the subscripts for p4-p3.

The Einstein summation notation sums over repeated subscripts that do not appear after the arrow. In this case, j is repeated and does not appear after the arrow.

So for each i, the sum (p1-p2)[i,j]*(p4-p3)[i,j] is computed, where the sum runs over all j. The result is an array indexed by i.


Sanity check:

In [90]: np.allclose(func1(xy, 2), func3(xy, 2))
Out[90]: True

Here is a benchmark: on an array xy of shape (9000, 2), showing using einsum is 450x faster that func1 and 7470x faster than func2:

In [13]: xy = np.tile(xy, (1000,1))

In [14]: %timeit func1(xy, 2)
10 loops, best of 3: 42.1 ms per loop

In [15]: %timeit func2(xy, 2)
1 loops, best of 3: 686 ms per loop

In [16]: %timeit func3(xy, 2)
10000 loops, best of 3: 91.8 µs per loop

The OP's func4 performs even better than func3!

In [92]: %timeit func4(xy, 2)
10000 loops, best of 3: 74.1 µs per loop

I think the reason why func4 is beating einsum here is because the cost of setting of the loop in einsum for just 2 iterations is too expensive compared to just manually writing out the sum.

Upvotes: 5

Related Questions