Cosinux
Cosinux

Reputation: 321

Why is Numba optimizing this regular Python loop, but not the numpy operations?

I wrote this simple test to gauge the performance of Numba and compare it to regular Python and Numpy:

import numba.cuda
import numba
import numpy
import time
import math



SIZE = 1000000
ITER = 10
BLOCK = 256



def func_py(result, op1, op2):
    for pos in range(SIZE):
        result[pos] += op1[pos] * op2[pos]


def func_numpy(result, op1, op2):
    result += op1 * op2


@numba.jit(nopython=True)
def func_numba_py(result, op1, op2):
    for pos in range(SIZE):
        result[pos] += op1[pos] * op2[pos]


@numba.jit(nopython=True)
def func_numba_numpy(result, op1, op2):
    result += op1 * op2


@numba.cuda.jit
def func_cuda(result, op1, op2):
    pos = numba.cuda.grid(1)
    if pos < SIZE:
        result[pos] += op1[pos] * op2[pos]



bnum = int(math.ceil(SIZE / BLOCK))



print("Python")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_py(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("Numpy")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_numpy(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("Numba python")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_numba_py(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("Numba_numpy")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_numba_numpy(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("CUDA")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_cuda[bnum, BLOCK](result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))

Here are the results:

Python
Call 1 | 353.78 ms (2.8 Hz)
Call 2 | 353.26 ms (2.8 Hz)
Call 3 | 356.26 ms (2.8 Hz)
Call 4 | 354.09 ms (2.8 Hz)
Call 5 | 356.45 ms (2.8 Hz)
Call 6 | 375.48 ms (2.7 Hz)
Call 7 | 355.36 ms (2.8 Hz)
Call 8 | 355.85 ms (2.8 Hz)
Call 9 | 356.12 ms (2.8 Hz)
Call 10 | 354.66 ms (2.8 Hz)

Numpy
Call 1 | 4.09 ms (244.7 Hz)
Call 2 | 4.36 ms (229.2 Hz)
Call 3 | 4.11 ms (243.1 Hz)
Call 4 | 3.99 ms (250.6 Hz)
Call 5 | 4.06 ms (246.0 Hz)
Call 6 | 4.55 ms (219.8 Hz)
Call 7 | 4.05 ms (246.9 Hz)
Call 8 | 4.31 ms (232.2 Hz)
Call 9 | 4.14 ms (241.4 Hz)
Call 10 | 4.40 ms (227.2 Hz)

Numba python
Call 1 | 107.88 ms (9.3 Hz)
Call 2 | 1.53 ms (654.1 Hz)
Call 3 | 1.47 ms (681.5 Hz)
Call 4 | 1.42 ms (706.2 Hz)
Call 5 | 1.45 ms (692.0 Hz)
Call 6 | 1.51 ms (664.3 Hz)
Call 7 | 1.48 ms (674.2 Hz)
Call 8 | 1.47 ms (682.5 Hz)
Call 9 | 1.40 ms (716.6 Hz)
Call 10 | 1.44 ms (696.4 Hz)

Numba_numpy
Call 1 | 235.23 ms (4.3 Hz)
Call 2 | 3.88 ms (257.7 Hz)
Call 3 | 4.17 ms (239.6 Hz)
Call 4 | 3.93 ms (254.2 Hz)
Call 5 | 3.90 ms (256.3 Hz)
Call 6 | 3.95 ms (253.1 Hz)
Call 7 | 4.16 ms (240.4 Hz)
Call 8 | 4.08 ms (245.1 Hz)
Call 9 | 3.97 ms (252.0 Hz)
Call 10 | 4.09 ms (244.6 Hz)

CUDA
Call 1 | 258.92 ms (3.9 Hz)
Call 2 | 11.67 ms (85.7 Hz)
Call 3 | 11.21 ms (89.2 Hz)
Call 4 | 12.61 ms (79.3 Hz)
Call 5 | 10.93 ms (91.5 Hz)
Call 6 | 11.21 ms (89.2 Hz)
Call 7 | 10.85 ms (92.2 Hz)
Call 8 | 12.30 ms (81.3 Hz)
Call 9 | 10.85 ms (92.2 Hz)
Call 10 | 10.86 ms (92.1 Hz)

I'm surprised to see that the fastest function here is the one that uses a Numba-optimized loop. I was under the impression that Numba was capable of optimizing Numpy code too, and I expected to at least see similar performance between func_numba_py and func_numba_numpy.

Why did Numba fail to optimize the simple Numpy function here?

Upvotes: 2

Views: 119

Answers (1)

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50921

The Numpy code is not as fast as the Numba code because of temporary arrays. Indeed, op1 * op2 produces allocates and write into a temporary array which is then read back by result += ... to finally write into the output array result. Such access are not a problem when the array are in the L1 CPU cache. However, the arrays are big here and may not fit in any CPU cache resulting in several slow RAM read/writes.

Numba can optimize some Numpy functions, but AFAIK they are not able to merge the computations so it can be done in a row. Actually, it is not always possible to remove all temporary arrays and this is pretty complex to do it correctly in the general case. For example result[1:-1] = result[2:] + result[:-2] cannot be done in-place due to aliasing. Moreover, doing this merge do not always make the code faster due to complex hardware effects (eg. broken vectorization and cache trashing).

Note that the @vectorize decorator should help to do this kind of optimization since in this case Numba can know the code does not contain any tricky corner case (eg. like aliasing issues) since the computation is implicitly done element-wise.

Finally, the GPU code deals with double-precision floating-point numbers while most mainstream Nvidia GPUs targeting personal computers are very slow for computing such numbers. Consider using simple-precision floating-point numbers (or mixed-precision) if you want to get a fast code on GPUs.

Upvotes: 1

Related Questions