How to use numba optimally accross multiple functions?

Question

Let's say I have two functions

def my_sub1(a):
    return a + 2

def my_main(a):
    a += 1
    b = mysub1(a)
    return b

and I want to make them faster using a just-in-time compiler like Numba. Is this going to be slower than if I refactor everything into one function

def my_main(a):
    a += 1
    b = a + 2
    return b

because Numba can to deeper optimizations in the second case? Of course my real functions are quite a bit more complex.

Also this whole situation get more difficult if a my_sub1 function get's called more than once - refactoring (and maintaining would become a drag). How does Numba solve this issue?

J&#233;r&#244;me Richard · Accepted Answer

Tl;dr: Numba is able to inline other Numba functions and it performs relatively advanced inter-procedural optimizations only when using native types (both functions are equally fast in this case), but not with Numpy arrays.

You can analyze the resulting assembly code produced by Numba to check how the two functions are optimized. Here is an example with an integer:

import numba as nb

@nb.njit('int64(int64)')
def my_sub1(a):
    return a + 2

@nb.njit('int64(int64)')
def my_main(a):
    a += 1
    b = my_sub1(a)
    return b

open('my_sub1.asm', 'w').write(list(my_sub1.inspect_asm().values())[0])
open('my_main.asm', 'w').write(list(my_main.inspect_asm().values())[0])

This produces two assembly files. If you compare the two file, you will see that the only actual difference (beside the different names) is that the first do addq $2, %rdx while the second do addq $3, %rdx. This means that Numba succeed to inline the call to my_sub1 in my_main and merge the summations. Here is the important part of the assembly code:

_ZN8__main__12my_sub1$2413Ex:
    addq    $2, %rdx
    movq    %rdx, (%rdi)
    xorl    %eax, %eax
    retq

_ZN8__main__12my_main$2414Ex:
    addq    $3, %rdx
    movq    %rdx, (%rdi)
    xorl    %eax, %eax
    retq

With 64-bit floats, the result is the same as long as you use fastmath=True since the floating-point addition is not associative.

Regarding Numpy arrays, the generated code gets huge and this is very difficult to compare the two codes. However, the my_sub1 function does not seems inlined anymore and Numba does not seem able to merge the Numpy computation (two distinct vectorized loops for the two array summation are present in the generated code). Note that this is similar to what many C/C++ compiler does. As a result, it is probably better to inline functions yourself in performance-critical part of your code.

How to use numba optimally accross multiple functions?

Answers (1)

Related Questions