Reputation: 1286
I have been playing around with SIMD OMP instructions and I am not getting the compiler to emit ANDPS
in my scenario.
What I'm trying to do:
unsigned long long
.AND
between two vectors of relationship and reduce
with a OR
which nicely fits the reduction pattern of OMP.g++ instructions (on a 2019 intel i-7 macbookPro):
g++-11 friends.cpp -S -O3 -fopenmp -fsanitize=address -Wshadow -Wall -march=native --std=c++17;
My implementation below
#include <vector>
#include <algorithm>
#include "iostream"
#include <cmath>
#include <numeric>
typedef long long ll;
typedef unsigned long long ull;
using namespace std;
ull find_sol(vector<vector<ull>> & input_data, int q) {
bool not_friend = false;
ull cnt = 0;
int size_arr = (int) input_data[0].size();
for (int i = 0; i < q; ++i) // from these friends
{
for (int j = i+1; j < q; ++j) // to these friends
{
int step = j/64;
int remainder = j - 64*step;
not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;
if(not_friend){
bool counter = false;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
__asm__ ("entry");
counter |= (v1[c] & v2[c])>0;
__asm__ ("exit");
}
if(counter>0)
cnt++;
}
}
}
return cnt << 1;
}
int main(){
int q;
cin >> q;
vector<vector<ull>> input_data(q,vector<ull>(1 + q/64,0ULL));
for (int i = 0; i < q; ++i)
{
string s;
cin >> s;
for (int j = 0; j < 1 + q/64; ++j)
{
string str = s.substr(j*64,64);
reverse(str.begin(),str.end());
ull ul = std::stoull(str,nullptr,2);
input_data.at(i).at(j) = ul;
}
}
cout << find_sol(input_data,q) << endl;
}
Looking at the assembly inside the loop, I would expect some SIMD instructions (specifically andps
) but I can't see them. What's preventing my compiler to emit them? Also, is there a way for the compiler to emit a warning re:what's wrong (would be very helpful)?
entry
# 0 "" 2
cmpb $0, (%rbx)
jne L53
movq (%r8), %rdx
leaq 0(,%rax,8), %rdi
addq %rdi, %rdx
movq %rdx, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L54
cmpb $0, (%r11)
movq (%rdx), %rdx
jne L55
addq (%r9), %rdi
movq %rdi, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L56
andq (%rdi), %rdx
movzbl (%r12), %edx
setne %dil
cmpb %r13b, %dl
jg L21
testb %dl, %dl
jne L57
L21:
orb %dil, -32(%r10)
EDIT 1:
Following Peter 1st and 2nd suggestion, I moved the marker out of the loop and I replaced the binarization by a simple OR
. I'm still not getting SIMD instructions though:
ull counter = 0;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
__asm__ ("entry" :::);
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
counter |= v1[c] & v2[c];
}
__asm__ ("exit" :::);
if(counter!=0)
cnt++;
Upvotes: 0
Views: 287
Reputation: 364308
First problem: asm
. In recent GCC, non-empty Basic Asm statements like __asm__ ("entry");
have an implicit ::: "memory"
clobber, making it impossible for the compiler to combine array accesses across iterations. Maybe try __asm__ ("entry" :::);
if you really want these markers. (Extended asm without a memory clobber).
Or better, use better tools for looking at compiler output, such as the Godbolt compiler explorer (https://godbolt.org/) which lets you right click on a source line and go to the corresponding asm. (Optimization can make this a bit wonky, so sometimes you have to find the asm and mouseover it to make sure it comes from that source line.)
See How to remove "noise" from GCC/clang assembly output?
Second problem: -fsanitize=address
makes it harder for the compiler to optimize. I only looked at GCC output without that option.
After fixing those showstoppers:
You're forcing the compiler to booleanize to an 8-bit bool
inside the inner loop, instead of just reducing the integer AND results with |=
into a variable of the same type. (Which you check once after the loop.) This is probably part of why GCC has a hard time; it often makes a mess with different-sized integer types when it vectorizes at all.
(v1[c] & v2[c]) > 0;
would need SSE4.1 pcmpeqq
vs. just SIMD OR in the loop and check counter
for !=0
after the loop. (You had bool counter
, which was really surprising given counter>0
as a semantically weird way to check an unsigned value for non-zero. Even more unexpected for a bool
.)
After changing that, GCC auto-vectorizes the way I expected without OpenMP, if you use -O3
(which includes -ftree-vectorize
). It of course uses with vpand
, not vandps
, since FP booleans have lower throughput on some CPUs. (You didn't say what -march=native
is for you; if you only had AVX1, e.g. on Sandybridge, then vandps
is plausible.)
ull counter = 0;
// #pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
//__asm__ ("entry");
counter |= (v1[c] & v2[c]);
//__asm__ ("exit");
}
if(counter != 0)
cnt++;
From the Godbolt compiler explorer (which you should use instead of littering your code with asm
statements)
# g++ 11.2 -O3 -march=skylake **without** OpenMP
.L7: # the vector part of the inner-most loop
vmovdqu ymm2, YMMWORD PTR [rsi+rax]
vpand ymm0, ymm2, YMMWORD PTR [rcx+rax]
add rax, 32
vpor ymm1, ymm1, ymm0
cmp rax, r8
jne .L7
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
... (horizontal OR reduction of that one SIMD vector, eventually vmovq to RAX)
With OpenMP, there is a vectorized version of the loop, but it sucks a lot, doing shuffles and gather loads, and storing results into a local buffer which it later reads. I don't know OpenMP that well, but unless you're using it wrong, this is a major missed optimization. Possibly it's scaling a loop counter with multiplies instead of incrementing a pointer, which is just horrible.
(Godbolt)
# g++ 11.2 -Wall -O3 -fopenmp -march=skylake -std=gnu++17
# with the #pragma uncommented
.L10:
vmovdqa ymm0, ymm3
vpermq ymm0, ymm0, 216
vpshufd ymm1, ymm0, 80 # unpack for 32x32 => 64-bit multiplies?
vpmuldq ymm1, ymm1, ymm4
vpshufd ymm0, ymm0, 250
vpmuldq ymm0, ymm0, ymm4
vmovdqa ymm7, ymm6 # ymm6 = set1(-1) outside the loop, gather mask
add rsi, 64
vpaddq ymm1, ymm1, ymm5
vpgatherqq ymm2, QWORD PTR [0+ymm1*1], ymm7
vpaddq ymm0, ymm0, ymm5
vmovdqa ymm7, ymm6
vpgatherqq ymm1, QWORD PTR [0+ymm0*1], ymm7
vpand ymm0, ymm1, YMMWORD PTR [rsi-32] # memory source = one array
vpand ymm1, ymm2, YMMWORD PTR [rsi-64]
vpor ymm0, ymm0, YMMWORD PTR [rsp+64] # OR with old contents of local buffer
vpor ymm1, ymm1, YMMWORD PTR [rsp+32]
vpaddd ymm3, ymm3, ymm4
vmovdqa YMMWORD PTR [rsp+32], ymm1 # and store back into it.
vmovdqa YMMWORD PTR [rsp+64], ymm0
cmp r9, rsi
jne .L10
mov edi, DWORD PTR [rsp+16] # outer loop tail
cmp DWORD PTR [rsp+20], edi
je .L7
This buffer of 64 bytes is read at the top of .L7
(an outer loop)
.L7:
vmovdqa ymm2, YMMWORD PTR [rsp+32]
vpor ymm1, ymm2, YMMWORD PTR [rsp+64]
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
vpor xmm0, xmm0, xmm1
vmovq rsi, xmm0
cmp rsi, 1 # sets CF unless RSI=0
sbb r13, -1 # R13 -= -1 +CF i.e. increment if CF=0
IDK if there's a way to hand-hold the compiler into making better asm; perhaps with pointer-width loop counters?
GCC5.4 -O3 -fopenmp -march=haswell -std=gnu++17
makes sane asm, with just vpand
/ vpor
and an array index increment in the loop. The stuff outside the loop is a bit different with OpenMP vs. plain vectorization, with OpenMP using vector store / scalar reload for the horizontal OR reduction of the final vector.
Upvotes: 3