How to use AVX instructions to optimize ReLU written in C

Question

I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?

#include 
#include 
#include 
#include 

void relu(double* &Amem, double* Z_curr, int bo)
{
    for (int i=0; i 0 ? Z_curr[i] : Z_curr[i]*0.1;
    }
}

int main()
{
    int i, j;
    int batch_size = 16384;
    int output_dim = 21;
    // double* Amem = new double[batch_size*output_dim];
    // double* Z_curr = new double[batch_size*output_dim];
    double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
    double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
    memset(Amem, 0, sizeof(double)*batch_size*output_dim);
    for (i=0; i


To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5

So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.

and this is the time it takes currently at my end:
time ./a.out
real    0m0.034s
user    0m0.026s
sys     0m0.009s

chtz · Accepted Answer

There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).

void relu(double* __restrict Amem, double* Z_curr, int bo)

Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).

If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb

Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).

For completeness, here is a possible manually vectorized implementation using intrinsics:

void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
    int i;
    for (i=0; i<=bo-8; i+=8)
    {
        __m512d z = _mm512_loadu_pd(Z_curr+i);
        __mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
        __m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
        _mm512_storeu_pd(Amem+i, res);
    }
    // remaining elements in scalar loop
    for (; i


Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

How to use AVX instructions to optimize ReLU written in C

Answers (1)

Related Questions