Reputation: 27
I have a bottleneck in my code it is numpy 3d array multiplication by * operator
with numpy 3d array.
I wanted to speed up this part of the program with numba @njit or @jit decorator, but it decreased performance in 2 times.
Slow part of the code:
@numba.jit
def mat_mul_and_sum(img1, img2, alpha):
return img1*(1-alpha) + img2*alpha
img1, img2, and alpha are 3d np.array with the same shapes.
Is it possible to speed up this line of code?
Upvotes: 2
Views: 327
Reputation: 51155
One option is actually using numba
the way it's supposed to be used (not just applying a decorator). However for your particular function you can make use of multicore rendering using the numexpr
package.
import numexpr as ne
def mat_mul_and_sum_numexpr(a, b, alpha):
return ne.evaluate('a*(1-alpha) + b*alpha')
Using the timings from the other answer:
In [11]: %timeit mat_mul_and_sum(img1, img2, alpha)
21.6 ms ± 955 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit mat_mul_and_sum2(img1, img2, alpha)
6.35 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit mat_mul_and_sum_numexpr(img1, img2, alpha)
4.22 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: np.allclose(mat_mul_and_sum(img1, img2, alpha), mat_mul_and_sum_numexpr(img1, img2, alpha))
Out[14]: True
You might be able to squeeze out some extra performance with numba
s parallelization, but oftentimes using numexpr
provides a nice performance boost without requiring any code re-write.
Upvotes: 4
Reputation: 68682
If you unroll the loops as follows, for an array size of (100, 100, 100), numba is twice as fast as the pure numpy version, likely due to the fact that no intermediate arrays need to be allocated:
import numpy as np
import numba as nb
def mat_mul_and_sum(img1, img2, alpha):
return img1*(1-alpha) + img2*alpha
@nb.jit
def mat_mul_and_sum2(img1, img2, alpha):
NI, NJ, NK = img1.shape
out = np.empty((NI, NJ, NK))
for i in range(NI):
for j in range(NJ):
for k in range(NK):
out[i,j,k] = img1[i,j,k] * (1.0 - alpha[i,j,k]) + img2[i,j,k] * alpha[i,j,k]
return out
and then testing:
N = 100
img1 = np.random.normal(size=(N, N, N))
img2 = np.random.normal(size=(N, N, N))
alpha = np.random.normal(size=(N, N, N))
A = mat_mul_and_sum(img1, img2, alpha)
B = mat_mul_and_sum2(img1, img2, alpha)
np.allclose(A,B) #True
%timeit mat_mul_and_sum(img1, img2, alpha)
# 4.6 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mat_mul_and_sum2(img1, img2, alpha)
# 2.4 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Update:
You can also try changing the decorator to nb.jit(parallel=True)
and then replace the outer loop with for i in nb.prange(NI):
, which on my machine drops the results from timeit
to 1.37 ms. This and other timings will certainly differ from machine to machine and also on the size of the inputs.
Upvotes: 3