Illia Iorin
Illia Iorin

Reputation: 27

How to speed up numpy tensor*tensor operation

I have a bottleneck in my code it is numpy 3d array multiplication by * operator with numpy 3d array.
I wanted to speed up this part of the program with numba @njit or @jit decorator, but it decreased performance in 2 times.
Slow part of the code:

@numba.jit
def mat_mul_and_sum(img1, img2, alpha):
    return img1*(1-alpha) + img2*alpha 

img1, img2, and alpha are 3d np.array with the same shapes.
Is it possible to speed up this line of code?

Upvotes: 2

Views: 327

Answers (2)

user3483203
user3483203

Reputation: 51155

One option is actually using numba the way it's supposed to be used (not just applying a decorator). However for your particular function you can make use of multicore rendering using the numexpr package.


import numexpr as ne

def mat_mul_and_sum_numexpr(a, b, alpha):
    return ne.evaluate('a*(1-alpha) + b*alpha')

Using the timings from the other answer:

In [11]: %timeit mat_mul_and_sum(img1, img2, alpha)
21.6 ms ± 955 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit mat_mul_and_sum2(img1, img2, alpha)
6.35 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit mat_mul_and_sum_numexpr(img1, img2, alpha)
4.22 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: np.allclose(mat_mul_and_sum(img1, img2, alpha), mat_mul_and_sum_numexpr(img1, img2, alpha))
Out[14]: True

You might be able to squeeze out some extra performance with numbas parallelization, but oftentimes using numexpr provides a nice performance boost without requiring any code re-write.

Upvotes: 4

JoshAdel
JoshAdel

Reputation: 68682

If you unroll the loops as follows, for an array size of (100, 100, 100), numba is twice as fast as the pure numpy version, likely due to the fact that no intermediate arrays need to be allocated:

import numpy as np
import numba as nb

def mat_mul_and_sum(img1, img2, alpha):
    return img1*(1-alpha) + img2*alpha


@nb.jit
def mat_mul_and_sum2(img1, img2, alpha):
    NI, NJ, NK = img1.shape
    out = np.empty((NI, NJ, NK))

    for i in range(NI):
        for j in range(NJ):
            for k in range(NK):
                out[i,j,k] = img1[i,j,k] * (1.0 - alpha[i,j,k]) + img2[i,j,k] * alpha[i,j,k]

    return out

and then testing:

N = 100
img1 = np.random.normal(size=(N, N, N))
img2 = np.random.normal(size=(N, N, N))
alpha = np.random.normal(size=(N, N, N))

A = mat_mul_and_sum(img1, img2, alpha)
B = mat_mul_and_sum2(img1, img2, alpha)

np.allclose(A,B) #True

%timeit mat_mul_and_sum(img1, img2, alpha)
# 4.6 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mat_mul_and_sum2(img1, img2, alpha)
# 2.4 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Update: You can also try changing the decorator to nb.jit(parallel=True) and then replace the outer loop with for i in nb.prange(NI):, which on my machine drops the results from timeit to 1.37 ms. This and other timings will certainly differ from machine to machine and also on the size of the inputs.

Upvotes: 3

Related Questions