matanox
matanox

Reputation: 13686

Leveraging and optimizing SIMD for matrix axis looping in cython

The following cython 3.0.8 code is being compiled and run on a machine with AVX2 support with the -O3 compilation flag, but does not seem to make any use of SIMD instructions when used from the below python main.

import numpy as np
from libc.math cimport fabs

cpdef inline double[:,:] go(double[:] stream, double[:] query):
      
    matrix = np.empty((len(stream), len(query)))
    cdef double [:, :] matrix_c = matrix
   
    cdef int i, j
    cdef int stream_len = len(stream)
    for i in range(stream_len):
        for j in range(len(query)):
            matrix_c[i, j] = fabs(stream[i] - query[j])

    return matrix

Running the python main with the perf command there are less than 10 calls to SIMD instructions when i and j range over 100,000 and 20 respectively:

import timeit
import numpy as np
import go

if __name__ == '__main__':
    np.random.seed(42)
    stream = np.random.rand(100000)
    query = np.linspace(0, 1, 20)
    print(f'mean runtime is {np.mean(timeit.repeat(lambda: go(stream, query), number=10, repeat=100))}')

Would you have any suggestions for how to modify it in order to allow the underlying compiler (x86_64-linux-gnu-gcc) to leverage SIMD vector operations?

For checking if SIMD is being used I am using the following command, though I'd gladly use a differnt method if you recommend one:

 perf stat -e instructions:u,fp_arith_inst_retired.128b_packed_single:u,fp_arith_inst_retired.256b_packed_single:u,fp_arith_inst_retired.128b_packed_double:u,fp_arith_inst_retired.256b_packed_double:u -- python -m main.py

Indeed perf shows 1.5 million uses of SIMD ...


This is cython 3.0.x, so some of the syntax may possibly look new to some people. I'm compiling and running on Ubuntu, it's my first time with cython as well as with looking to explicitly coerce the use of SIMD. Trying out -march=native makes no difference.


I'm attaching the gcc invocations made during the cython build:

x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/.venv3.10/include -I/usr/include/python3.10 -c ....c -o build/temp.linux-x86_64-cpython-310/....o -O3 -march=native

x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-310/....o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-310/....cpython-310-x86_64-linux-gnu.so -O3 -march=native


using the -fopt-info flag does yield 24 instances of loops being vectorized in the now much much longer build output (thousands of lines) however they relate to lines of code which I'm hard to associate to my own code's logic, such as lines of code where cython handles its memory views. Output lines such as:

main.c:14454:3: optimized: loop vectorized using 32 byte vectors

I guess that had I written this in C myself, I could have associated the optimization messages to my code quite more directly.

Update

Double vectorization is already taking place: enter image description here

So it only remains to see if the code can be optimized further, which is slightly easier to do in pure C rather than through a generated C source of dozens of thousands of lines of code from cython :-)

Upvotes: 3

Views: 126

Answers (1)

matanox
matanox

Reputation: 13686

Well, SIMD is being used after all, I was just negligent in choosing which instructions to let the perf command monitor. Since my code uses doubles, the following is the appropriate perf command to demonstrate the amount of SIMD going on over doubles:

 perf stat -e instructions:u,fp_arith_inst_retired.128b_packed_double:u,fp_arith_inst_retired.256b_packed_double:u -- python -m main.py

Which yields something like:

enter image description here

The previously included *_single AVX instructions are superfluous to monitor in this specific case, but may come in handy in other cases (when non-double float types are involved).

So the 100,000 outer loops of this cython code, ~1.5 million SIMD instructions operating on doubles on a 256 bits register are taking place.

The inner loop ranges over 20 subtractions of doubles, which after packing the numbers into the SIMD register can take just 5 cpu cycles (4 doubles packed on the 256 bit register) per inner loop execution. This makes sense.

I can't rule out that this can be further optimized (on a single core) through better cython or in ways that make cython jump through less hoops going around the numpy array inputs. Writing this directly in C ― while generating a numpy array to return through numpy's C api ― should potentially be faster to optimize, hopefully without scarificing safety.

Thank you @PeterCordes for all your help on this.

Upvotes: 1

Related Questions