OMP SIMD logical AND on unsigned long long

Question

I have been playing around with SIMD OMP instructions and I am not getting the compiler to emit ANDPS in my scenario.

What I'm trying to do:

This is an implementation of this problem (tldr: find pair of users with a common friend). My approach is to pack 64 bits (whether somebody is a friend or not) into an unsigned long long.
My SIMD approach: Take AND between two vectors of relationship and reduce with a OR which nicely fits the reduction pattern of OMP.

g++ instructions (on a 2019 intel i-7 macbookPro):

g++-11 friends.cpp -S -O3 -fopenmp -fsanitize=address -Wshadow -Wall -march=native --std=c++17;

My implementation below


#include 

#include 

#include "iostream"

#include 

#include 

typedef long long ll;
typedef unsigned long long ull;

using namespace std;


ull find_sol(vector> & input_data, int q) {

    bool not_friend = false;
    ull cnt = 0;
    int size_arr = (int) input_data[0].size();

    for (int i = 0; i < q; ++i) // from these friends
    {
        for (int j = i+1; j < q; ++j) // to these friends
        {
            int step = j/64;
            int remainder = j - 64*step;
        
            not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;
            
            if(not_friend){   
                bool counter = false;
                
                vector & v1 = input_data[i];
                vector & v2 = input_data[j];

                #pragma omp simd reduction(|:counter)
                for (int c = 0; c < size_arr; ++c)
                {
                    __asm__ ("entry");
                    counter |= (v1[c] & v2[c])>0;
                    __asm__ ("exit");
                }

                if(counter>0)
                    cnt++;
            }
        }
    }
    return cnt << 1;
}


int main(){

    int q;
    cin >> q;
    vector> input_data(q,vector(1 + q/64,0ULL));

    for (int i = 0; i < q; ++i)
    {

        string s;
        cin >> s;

        for (int j = 0; j < 1 + q/64; ++j)
        {
            string str = s.substr(j*64,64);
            reverse(str.begin(),str.end());
            ull ul = std::stoull(str,nullptr,2);
            input_data.at(i).at(j) = ul;
        }
        
    }

    cout << find_sol(input_data,q) << endl;


    }

Looking at the assembly inside the loop, I would expect some SIMD instructions (specifically andps) but I can't see them. What's preventing my compiler to emit them? Also, is there a way for the compiler to emit a warning re:what's wrong (would be very helpful)?


    entry
# 0 "" 2
    cmpb    $0, (%rbx)
    jne L53
    movq    (%r8), %rdx
    leaq    0(,%rax,8), %rdi
    addq    %rdi, %rdx
    movq    %rdx, %r15
    shrq    $3, %r15
    cmpb    $0, (%r15,%rcx)
    jne L54
    cmpb    $0, (%r11)
    movq    (%rdx), %rdx
    jne L55
    addq    (%r9), %rdi
    movq    %rdi, %r15
    shrq    $3, %r15
    cmpb    $0, (%r15,%rcx)
    jne L56
    andq    (%rdi), %rdx
    movzbl  (%r12), %edx
    setne   %dil
    cmpb    %r13b, %dl
    jg  L21
    testb   %dl, %dl
    jne L57
L21:
    orb %dil, -32(%r10)

EDIT 1: Following Peter 1st and 2nd suggestion, I moved the marker out of the loop and I replaced the binarization by a simple OR. I'm still not getting SIMD instructions though:

           ull counter = 0;
                
                vector & v1 = input_data[i];
                vector & v2 = input_data[j];


                __asm__ ("entry" :::);

                #pragma omp simd reduction(|:counter)
                for (int c = 0; c < size_arr; ++c)
                {
                    counter |= v1[c] & v2[c];
                }
                __asm__ ("exit" :::);

                if(counter!=0)
                    cnt++;

Peter Cordes · Accepted Answer

First problem: asm. In recent GCC, non-empty Basic Asm statements like __asm__ ("entry"); have an implicit ::: "memory" clobber, making it impossible for the compiler to combine array accesses across iterations. Maybe try __asm__ ("entry" :::); if you really want these markers. (Extended asm without a memory clobber).

Or better, use better tools for looking at compiler output, such as the Godbolt compiler explorer (https://godbolt.org/) which lets you right click on a source line and go to the corresponding asm. (Optimization can make this a bit wonky, so sometimes you have to find the asm and mouseover it to make sure it comes from that source line.)

See How to remove "noise" from GCC/clang assembly output?

Second problem: -fsanitize=address makes it harder for the compiler to optimize. I only looked at GCC output without that option.

Vectorizing the OR reduction

After fixing those showstoppers:

You're forcing the compiler to booleanize to an 8-bit bool inside the inner loop, instead of just reducing the integer AND results with |= into a variable of the same type. (Which you check once after the loop.) This is probably part of why GCC has a hard time; it often makes a mess with different-sized integer types when it vectorizes at all.

(v1[c] & v2[c]) > 0; would need SSE4.1 pcmpeqqvs. just SIMD OR in the loop and check counter for !=0 after the loop. (You had bool counter, which was really surprising given counter>0 as a semantically weird way to check an unsigned value for non-zero. Even more unexpected for a bool.)

After changing that, GCC auto-vectorizes the way I expected without OpenMP, if you use -O3 (which includes -ftree-vectorize). It of course uses with vpand, not vandps, since FP booleans have lower throughput on some CPUs. (You didn't say what -march=native is for you; if you only had AVX1, e.g. on Sandybridge, then vandps is plausible.)

                ull counter = 0;
                // #pragma omp simd reduction(|:counter)
                for (int c = 0; c < size_arr; ++c)
                {
                    //__asm__ ("entry");
                    counter |= (v1[c] & v2[c]);
                    //__asm__ ("exit");
                }

                if(counter != 0)
                    cnt++;

From the Godbolt compiler explorer (which you should use instead of littering your code with asm statements)

# g++ 11.2 -O3 -march=skylake   **without** OpenMP
.L7:                              # the vector part of the inner-most loop
        vmovdqu ymm2, YMMWORD PTR [rsi+rax]
        vpand   ymm0, ymm2, YMMWORD PTR [rcx+rax]
        add     rax, 32
        vpor    ymm1, ymm1, ymm0
        cmp     rax, r8
        jne     .L7
        vextracti128    xmm0, ymm1, 0x1
        vpor    xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        ...  (horizontal OR reduction of that one SIMD vector, eventually vmovq to RAX)

GCC OpenMP does vectorize, but badly / weirdly

With OpenMP, there is a vectorized version of the loop, but it sucks a lot, doing shuffles and gather loads, and storing results into a local buffer which it later reads. I don't know OpenMP that well, but unless you're using it wrong, this is a major missed optimization. Possibly it's scaling a loop counter with multiplies instead of incrementing a pointer, which is just horrible.

(Godbolt)

# g++ 11.2 -Wall -O3 -fopenmp -march=skylake -std=gnu++17
                     # with the #pragma uncommented
.L10:
        vmovdqa ymm0, ymm3
        vpermq  ymm0, ymm0, 216
        vpshufd ymm1, ymm0, 80      # unpack for 32x32 => 64-bit multiplies?
        vpmuldq ymm1, ymm1, ymm4
        vpshufd ymm0, ymm0, 250
        vpmuldq ymm0, ymm0, ymm4
        vmovdqa ymm7, ymm6           # ymm6 = set1(-1) outside the loop, gather mask
        add     rsi, 64
        vpaddq  ymm1, ymm1, ymm5
        vpgatherqq      ymm2, QWORD PTR [0+ymm1*1], ymm7
        vpaddq  ymm0, ymm0, ymm5
        vmovdqa ymm7, ymm6
        vpgatherqq      ymm1, QWORD PTR [0+ymm0*1], ymm7
        vpand   ymm0, ymm1, YMMWORD PTR [rsi-32]      # memory source = one array
        vpand   ymm1, ymm2, YMMWORD PTR [rsi-64]

        vpor    ymm0, ymm0, YMMWORD PTR [rsp+64]     # OR with old contents of local buffer
        vpor    ymm1, ymm1, YMMWORD PTR [rsp+32]
        vpaddd  ymm3, ymm3, ymm4
        vmovdqa YMMWORD PTR [rsp+32], ymm1           # and store back into it.
        vmovdqa YMMWORD PTR [rsp+64], ymm0
        cmp     r9, rsi
        jne     .L10

        mov     edi, DWORD PTR [rsp+16]       # outer loop tail
        cmp     DWORD PTR [rsp+20], edi
        je      .L7

This buffer of 64 bytes is read at the top of .L7 (an outer loop)

.L7:
        vmovdqa ymm2, YMMWORD PTR [rsp+32]
        vpor    ymm1, ymm2, YMMWORD PTR [rsp+64]
        vextracti128    xmm0, ymm1, 0x1
        vpor    xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpor    xmm0, xmm0, xmm1
        vmovq   rsi, xmm0

        cmp     rsi, 1                   # sets CF unless RSI=0
        sbb     r13, -1                  # R13 -= -1 +CF    i.e. increment if CF=0

IDK if there's a way to hand-hold the compiler into making better asm; perhaps with pointer-width loop counters?

GCC5.4 -O3 -fopenmp -march=haswell -std=gnu++17 makes sane asm, with just vpand / vpor and an array index increment in the loop. The stuff outside the loop is a bit different with OpenMP vs. plain vectorization, with OpenMP using vector store / scalar reload for the horizontal OR reduction of the final vector.

OMP SIMD logical AND on unsigned long long

Answers (1)

Vectorizing the OR reduction

GCC OpenMP does vectorize, but badly / weirdly

Related Questions