satoru
satoru

Reputation: 33235

Why is matrix subtraction much slower than dot product in numpy?

Let's say we have a vector and a matrix:

X = np.random.random((1, 384)).astype('float32')
Y = np.random.random((500000, 384)).astype('float32')

Why is np.dot(X, Y.T) much faster than X - Y?

In [8]: %timeit np.dot(X, Y.T)
10 loops, best of 3: 42.4 ms per loop

In [9]: %timeit X - Y
1 loop, best of 3: 501 ms per loop

What can I do to make subtraction like this as faster as dot product?

Upvotes: 4

Views: 1683

Answers (3)

user6655984
user6655984

Reputation:

The size of the output matters, because the output has to be written to memory, and writing a large array takes time. The shape of dot(X, Y.T) is (1, 500000). The shape of X-Y is (500000, 384).

In my test, most of the time taken by X-Y was allocating an array for the output. Compare:

%timeit X - Y   
1 loop, best of 3: 449 ms per loop

with pre-allocated space Z = np.zeros_like(Y),

%timeit np.subtract(X, Y, out=Z)
10 loops, best of 3: 181 ms per loop

So, if you have to do this kind of subtraction repeatedly, having a pre-allocated array of suitable shape and type will save more than half of the execution time.

I don't think that subtraction in your case can be made as fast as multiplication. The amount of arithmetics to do is about the same: each entry of X meets 500000 entries of Y either way. The fact that the results are combined when you do multiplication (the summation step) only helps, as the CPU does it quickly with the numbers it already has lying around, and as a result it has only one number to send back. So: about the same amount of work, but the amount of memory-writing is 384 times more for the case of subtraction.

Here is a proof that subtraction is faster when the output size is the same for both (square matrices):

X = np.random.random((1000, 1000)).astype('float32')
Y = np.random.random((1000, 1000)).astype('float32')

%timeit np.dot(X, Y.T)
100 loops, best of 3: 28.7 ms per loop

%timeit X - Y
1000 loops, best of 3: 579 µs per loop

Upvotes: 4

dkato
dkato

Reputation: 895

This is just a comment, although you already have checked.

I tested whether the broadcast is a cause, but it was not related to performance at all in this case.

In [1]: import numpy as np

In [2]: X = np.random.random((1, 384)).astype('float32')
   ...: Y = np.random.random((500000, 384)).astype('float32')

In [3]: %timeit np.dot(X, Y.T)
   ...: %timeit X - Y
27.4 ms ± 910 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
324 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: import numpy.matlib
   ...: X = np.matlib.repmat(X, 500000, 1)
   ...: print(X.shape)
   ...: %timeit X - Y
(500000, 384)
351 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Please forgive me I'm not sure how to improve this performance.

Upvotes: 0

Aisha Javed
Aisha Javed

Reputation: 169

Using the np.subtract(X,Y) almost halves the execution time. It is faster than X-Y but the speed is till lower than the dot product. Might helpenter image description here

Upvotes: -1

Related Questions