Reputation: 1680
I am trying to convert a function from an implementation using intrinsics into standard C++ (to simplify maintenance, portability, etc.). Everything worked fine, except for a loop with stride 2 where bytes at odd positions are gathered into one location and bytes at odd positions are gathered into another location.
Related questions have been addressed using opencl or intrinsics, but I would like to stick to standard c++.
A minimal example of what I am trying to auto-vectorize would be something like this:
void f(const unsigned char *input, const unsigned size, unsigned char *output) {
constexpr unsigned MAX_SIZE = 2000;
unsigned char odd[MAX_SIZE / 2];
unsigned char even[MAX_SIZE / 2];
for (unsigned i = 0; size > i; ++i) {
if (0 == i % 2) {even[i/2] = input[i];}
else {odd[i/2] = input[i];}
}
//for (unsigned i = 0; size > i; i+=2) {
// even[i/2] = input[i];
// odd[i/2] = input[i+1];
//}
for (unsigned i = 0; size / 2 > i; ++i)
{
output[i] = (even[i] << 4) | odd[i];
}
}
Compiling with g++-11.2, the output of -fopt-info-vec-missed is:
minimal.cpp:6:29: missed: couldn't vectorize loop
minimal.cpp:6:29: missed: not vectorized: control flow in loop.
If I change the implementation to the one that is commented out in the code, g++ fails to vectorize because:
minimal.cpp:11:29: missed: couldn't vectorize loop
minimal.cpp:13:24: missed: not vectorized: not suitable for gather load _13 = *_11;
Considering that it is straightforward to implement this with packed shuffle bytes instructions, I am surprised that g++ can't do it.
Is there a way to re-write the loop so that g++ would be able to vectorize it?
Upvotes: 0
Views: 542
Reputation: 364170
It seems GCC doesn't like stuff like i<size ; i += 2
. Instead, it liked i<size/2 ; i++
. GCC and clang can't auto-vectorize loops whose trip-count can't be determined ahead of time. Perhaps GCC has a problem with this because you used unsigned
, so i+=2
could wrap back to 0
without ever hitting size
, so i<size
could be permanently false, i.e. the compiler can't prove your loop isn't infinite because size = UINT_MAX
is possible. (Which disables some optimizations compilers like to do, although at least it's unsigned so we don't have to redo sign extension.)
Clang managed to vectorize anyway (poorly: https://godbolt.org/z/b4G4jojn1); possibly it realized that evens[i]
would be UB if greater than the constant MAX_SIZE, or else it just didn't care.
The temporary arrays seem unnecessary; I think you were only using them to try to give GCC multiple simpler problems to vectorize?
// __restrict is optional; it promises the compiler input and output won't overlap
// it still vectorizes without it, but does a check for overlap
void g(const unsigned char *__restrict input, const unsigned size, unsigned char *__restrict output)
{
for (unsigned i = 0 ; size/2 > i; i++) {
output[i] = (input[2*i] << 4) | input[2*i+1];
}
}
Without __restrict
, on overlap it falls back to a scalar loop. In the case of input = output
exactly, the vector version is still safe. I didn't test or reverse-engineer the overlap check to see whether it uses the vectorized version or not in that case. (It would be C++ UB to use it with input=output
with __restrict
, though.)
GCC11.2 -O3 -march=haswell
auto-vectorizes this fairly reasonably (Godbolt); some missed optimizations but not as bad as with separate loops, and of course avoids touching new stack memory. The main inner loop looks like this:
# GCC11 -O3 -march=haswell
# before loop, YMM3 = _mm256_set1_epi16(0x00FF)
.L4: # do{
vpand ymm1, ymm3, YMMWORD PTR [rcx+32+rax*2] # why not reuse the load results for both odd/even? fortunately modern CPUs have good L1d bandwidth
vpand ymm0, ymm3, YMMWORD PTR [rcx+rax*2] # evens: load input[2*(i+0..31)] and AND away the high bytes for pack
vmovdqu ymm4, YMMWORD PTR [rcx+rax*2] # load 2 vectors of input data
vmovdqu ymm5, YMMWORD PTR [rcx+32+rax*2]
vpackuswb ymm0, ymm0, ymm1 # evens: pack evens down to single bytes.
vpsrlw ymm2, ymm5, 8 # odds: shift down to line up with evens
vpsrlw ymm1, ymm4, 8
vpermq ymm0, ymm0, 216 # evens: lane-crossing fixup
vpaddb ymm0, ymm0, ymm0 # evens <<= 1 byte shift (x86 SIMD lacks a vpsllb, even with AVX-512)
vpackuswb ymm1, ymm1, ymm2 # odds: pack
vpaddb ymm0, ymm0, ymm0 # evens <<= 1
vpermq ymm1, ymm1, 216 # odds: lane-crossing fixup
vpaddb ymm0, ymm0, ymm0 # evens <<= 1
vpaddb ymm0, ymm0, ymm0 # evens <<= 1
vpor ymm0, ymm0, ymm1 # (evens<<4) | odds
vmovdqu YMMWORD PTR [rdi+rax], ymm0 # store to output
add rax, 32 # advance output position by 32 bytes. (Input positions scale by 2)
cmp rdx, rax
jne .L4 # } while(i != size/2)
It would have been faster if GCC had chosen to mask with 0x000F
instead of 0x00FF
before packing, so the packed evens could be left-shifted with vpsllw
instead of 4x vpaddb
without spilling any non-zero bits into the next byte. Or just shift and AND again; that's the standard way to emulate the non-existent vpsllb
.
Or even better, OR together high and low within each word before packing down to bytes.
# manually vectorized; what GCC could have done in theory
# if using intrinsics, this strategy is probably good.
vmovdqu ymm0, [mem]
vmovdqu ymm1, [mem+32]
vpsllw ymm2, ymm0, 12 # evens: line up with odds, and do the <<4
vpsllw ymm3, ymm1, 12
vpor ymm0, ymm0, ymm2 # odds |= (evens<<4) in the high byte of each word
vpor ymm1, ymm1, ymm3
vpsrlw ymm0, ymm0, 8 # shift merged to bottom of word
vpsrlw ymm1, ymm1, 8
vpackuswb ymm0, ymm0, ymm1 # and pack
vpermq ymm0, ymm0, 0xDB # same 216
vmovdqu [mem], ymm0
.. pointer increment / loop condition
Notice that we avoided an AND constant; both halves needed shifting anyway (odd to be in the right place for pack, even because of <<4
). Shifting after packing would mean half as much data to shift, but would have needed masking after the shift so it's break except for back-end port pressure on ALU ports with shift units. (https://agner.org/optimize/ ; https://uops.info/). But merging before packing saves shuffles, and that's a bigger throughput bottleneck on Intel CPUs.
If we can add instead of OR (because we know there aren't overlapping bits so it's equivalent), we could 2x vpmaddubsw
(_mm256_maddubs_epi16
) using the signed (second) operand as a _mm256_set1_epi16(0x0110)
and the unsigned (first) input holding data from the array to do input[2*i+1] + (input[2*i] * 16)
within each byte pair. Then AND and VPACKUSWB / VPERMQ from words down to byte elements and store.
Upvotes: 3
Reputation: 191
Oh, I found @Peter Cordes 's comment and I combined with my initial answer:
https://gcc.godbolt.org/z/bxzsfxPGx
and -fopt-info-vec-missed
doesn't say anything to me
void f(const unsigned char *input, const unsigned size, unsigned char *output) {
constexpr unsigned MAX_SIZE = 2000;
unsigned char odd[MAX_SIZE / 2];
unsigned char even[MAX_SIZE / 2];
for (unsigned i = 0, j = 0; size > i; i += 2, ++j) {
even[j] = input[i];
odd[j] = input[i + 1];
}
for (unsigned i = 0; size / 2 > i; ++i) {
output[i] = (even[i] << 4) | odd[i];
}
}
Upvotes: 2