Guillaume
Guillaume

Reputation: 1286

OMP SIMD logical AND on unsigned long long

I have been playing around with SIMD OMP instructions and I am not getting the compiler to emit ANDPS in my scenario.

What I'm trying to do:

g++ instructions (on a 2019 intel i-7 macbookPro):

g++-11 friends.cpp -S -O3 -fopenmp -fsanitize=address -Wshadow -Wall -march=native --std=c++17;

My implementation below


#include <vector>

#include <algorithm>

#include "iostream"

#include <cmath>

#include <numeric>

typedef long long ll;
typedef unsigned long long ull;

using namespace std;


ull find_sol(vector<vector<ull>> & input_data, int q) {

    bool not_friend = false;
    ull cnt = 0;
    int size_arr = (int) input_data[0].size();

    for (int i = 0; i < q; ++i) // from these friends
    {
        for (int j = i+1; j < q; ++j) // to these friends
        {
            int step = j/64;
            int remainder = j - 64*step;
        
            not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;
            
            if(not_friend){   
                bool counter = false;
                
                vector<ull> & v1 = input_data[i];
                vector<ull> & v2 = input_data[j];

                #pragma omp simd reduction(|:counter)
                for (int c = 0; c < size_arr; ++c)
                {
                    __asm__ ("entry");
                    counter |= (v1[c] & v2[c])>0;
                    __asm__ ("exit");
                }

                if(counter>0)
                    cnt++;
            }
        }
    }
    return cnt << 1;
}


int main(){

    int q;
    cin >> q;
    vector<vector<ull>> input_data(q,vector<ull>(1 + q/64,0ULL));

    for (int i = 0; i < q; ++i)
    {

        string s;
        cin >> s;

        for (int j = 0; j < 1 + q/64; ++j)
        {
            string str = s.substr(j*64,64);
            reverse(str.begin(),str.end());
            ull ul = std::stoull(str,nullptr,2);
            input_data.at(i).at(j) = ul;
        }
        
    }

    cout << find_sol(input_data,q) << endl;


    }

Looking at the assembly inside the loop, I would expect some SIMD instructions (specifically andps) but I can't see them. What's preventing my compiler to emit them? Also, is there a way for the compiler to emit a warning re:what's wrong (would be very helpful)?


    entry
# 0 "" 2
    cmpb    $0, (%rbx)
    jne L53
    movq    (%r8), %rdx
    leaq    0(,%rax,8), %rdi
    addq    %rdi, %rdx
    movq    %rdx, %r15
    shrq    $3, %r15
    cmpb    $0, (%r15,%rcx)
    jne L54
    cmpb    $0, (%r11)
    movq    (%rdx), %rdx
    jne L55
    addq    (%r9), %rdi
    movq    %rdi, %r15
    shrq    $3, %r15
    cmpb    $0, (%r15,%rcx)
    jne L56
    andq    (%rdi), %rdx
    movzbl  (%r12), %edx
    setne   %dil
    cmpb    %r13b, %dl
    jg  L21
    testb   %dl, %dl
    jne L57
L21:
    orb %dil, -32(%r10)

EDIT 1: Following Peter 1st and 2nd suggestion, I moved the marker out of the loop and I replaced the binarization by a simple OR. I'm still not getting SIMD instructions though:

           ull counter = 0;
                
                vector<ull> & v1 = input_data[i];
                vector<ull> & v2 = input_data[j];


                __asm__ ("entry" :::);

                #pragma omp simd reduction(|:counter)
                for (int c = 0; c < size_arr; ++c)
                {
                    counter |= v1[c] & v2[c];
                }
                __asm__ ("exit" :::);

                if(counter!=0)
                    cnt++;

Upvotes: 0

Views: 287

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364308

First problem: asm. In recent GCC, non-empty Basic Asm statements like __asm__ ("entry"); have an implicit ::: "memory" clobber, making it impossible for the compiler to combine array accesses across iterations. Maybe try __asm__ ("entry" :::); if you really want these markers. (Extended asm without a memory clobber).

Or better, use better tools for looking at compiler output, such as the Godbolt compiler explorer (https://godbolt.org/) which lets you right click on a source line and go to the corresponding asm. (Optimization can make this a bit wonky, so sometimes you have to find the asm and mouseover it to make sure it comes from that source line.)

See How to remove "noise" from GCC/clang assembly output?

Second problem: -fsanitize=address makes it harder for the compiler to optimize. I only looked at GCC output without that option.


Vectorizing the OR reduction

After fixing those showstoppers:

You're forcing the compiler to booleanize to an 8-bit bool inside the inner loop, instead of just reducing the integer AND results with |= into a variable of the same type. (Which you check once after the loop.) This is probably part of why GCC has a hard time; it often makes a mess with different-sized integer types when it vectorizes at all.

(v1[c] & v2[c]) > 0; would need SSE4.1 pcmpeqqvs. just SIMD OR in the loop and check counter for !=0 after the loop. (You had bool counter, which was really surprising given counter>0 as a semantically weird way to check an unsigned value for non-zero. Even more unexpected for a bool.)

After changing that, GCC auto-vectorizes the way I expected without OpenMP, if you use -O3 (which includes -ftree-vectorize). It of course uses with vpand, not vandps, since FP booleans have lower throughput on some CPUs. (You didn't say what -march=native is for you; if you only had AVX1, e.g. on Sandybridge, then vandps is plausible.)

                ull counter = 0;
                // #pragma omp simd reduction(|:counter)
                for (int c = 0; c < size_arr; ++c)
                {
                    //__asm__ ("entry");
                    counter |= (v1[c] & v2[c]);
                    //__asm__ ("exit");
                }

                if(counter != 0)
                    cnt++;

From the Godbolt compiler explorer (which you should use instead of littering your code with asm statements)

# g++ 11.2 -O3 -march=skylake   **without** OpenMP
.L7:                              # the vector part of the inner-most loop
        vmovdqu ymm2, YMMWORD PTR [rsi+rax]
        vpand   ymm0, ymm2, YMMWORD PTR [rcx+rax]
        add     rax, 32
        vpor    ymm1, ymm1, ymm0
        cmp     rax, r8
        jne     .L7
        vextracti128    xmm0, ymm1, 0x1
        vpor    xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        ...  (horizontal OR reduction of that one SIMD vector, eventually vmovq to RAX)

GCC OpenMP does vectorize, but badly / weirdly

With OpenMP, there is a vectorized version of the loop, but it sucks a lot, doing shuffles and gather loads, and storing results into a local buffer which it later reads. I don't know OpenMP that well, but unless you're using it wrong, this is a major missed optimization. Possibly it's scaling a loop counter with multiplies instead of incrementing a pointer, which is just horrible.

(Godbolt)

# g++ 11.2 -Wall -O3 -fopenmp -march=skylake -std=gnu++17
                     # with the #pragma uncommented
.L10:
        vmovdqa ymm0, ymm3
        vpermq  ymm0, ymm0, 216
        vpshufd ymm1, ymm0, 80      # unpack for 32x32 => 64-bit multiplies?
        vpmuldq ymm1, ymm1, ymm4
        vpshufd ymm0, ymm0, 250
        vpmuldq ymm0, ymm0, ymm4
        vmovdqa ymm7, ymm6           # ymm6 = set1(-1) outside the loop, gather mask
        add     rsi, 64
        vpaddq  ymm1, ymm1, ymm5
        vpgatherqq      ymm2, QWORD PTR [0+ymm1*1], ymm7
        vpaddq  ymm0, ymm0, ymm5
        vmovdqa ymm7, ymm6
        vpgatherqq      ymm1, QWORD PTR [0+ymm0*1], ymm7
        vpand   ymm0, ymm1, YMMWORD PTR [rsi-32]      # memory source = one array
        vpand   ymm1, ymm2, YMMWORD PTR [rsi-64]

        vpor    ymm0, ymm0, YMMWORD PTR [rsp+64]     # OR with old contents of local buffer
        vpor    ymm1, ymm1, YMMWORD PTR [rsp+32]
        vpaddd  ymm3, ymm3, ymm4
        vmovdqa YMMWORD PTR [rsp+32], ymm1           # and store back into it.
        vmovdqa YMMWORD PTR [rsp+64], ymm0
        cmp     r9, rsi
        jne     .L10

        mov     edi, DWORD PTR [rsp+16]       # outer loop tail
        cmp     DWORD PTR [rsp+20], edi
        je      .L7

This buffer of 64 bytes is read at the top of .L7 (an outer loop)

.L7:
        vmovdqa ymm2, YMMWORD PTR [rsp+32]
        vpor    ymm1, ymm2, YMMWORD PTR [rsp+64]
        vextracti128    xmm0, ymm1, 0x1
        vpor    xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpor    xmm0, xmm0, xmm1
        vmovq   rsi, xmm0

        cmp     rsi, 1                   # sets CF unless RSI=0
        sbb     r13, -1                  # R13 -= -1 +CF    i.e. increment if CF=0

IDK if there's a way to hand-hold the compiler into making better asm; perhaps with pointer-width loop counters?

GCC5.4 -O3 -fopenmp -march=haswell -std=gnu++17 makes sane asm, with just vpand / vpor and an array index increment in the loop. The stuff outside the loop is a bit different with OpenMP vs. plain vectorization, with OpenMP using vector store / scalar reload for the horizontal OR reduction of the final vector.

Upvotes: 3

Related Questions