Come Raczy
Come Raczy

Reputation: 1680

How to autovectorize a loop with access stride 2 with g++ without openCL or intrinsics

I am trying to convert a function from an implementation using intrinsics into standard C++ (to simplify maintenance, portability, etc.). Everything worked fine, except for a loop with stride 2 where bytes at odd positions are gathered into one location and bytes at odd positions are gathered into another location.

Related questions have been addressed using opencl or intrinsics, but I would like to stick to standard c++.

A minimal example of what I am trying to auto-vectorize would be something like this:

void f(const unsigned char *input, const unsigned size, unsigned char *output) {
  constexpr unsigned MAX_SIZE = 2000;
  unsigned char odd[MAX_SIZE / 2];
  unsigned char even[MAX_SIZE / 2];
  for (unsigned i = 0; size > i; ++i) {
    if (0 == i % 2) {even[i/2] = input[i];}
    else {odd[i/2] = input[i];}
  }
  //for (unsigned i = 0; size > i; i+=2) {
  //  even[i/2] = input[i];
  //  odd[i/2] = input[i+1];
  //}
  for (unsigned i = 0; size / 2 > i; ++i)
  {
    output[i] = (even[i] << 4) | odd[i];
  }

}

Compiling with g++-11.2, the output of -fopt-info-vec-missed is:

minimal.cpp:6:29: missed: couldn't vectorize loop
minimal.cpp:6:29: missed: not vectorized: control flow in loop.

If I change the implementation to the one that is commented out in the code, g++ fails to vectorize because:

minimal.cpp:11:29: missed: couldn't vectorize loop
minimal.cpp:13:24: missed: not vectorized: not suitable for gather load _13 = *_11;

Considering that it is straightforward to implement this with packed shuffle bytes instructions, I am surprised that g++ can't do it.

Is there a way to re-write the loop so that g++ would be able to vectorize it?

Upvotes: 0

Views: 542

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364170

It seems GCC doesn't like stuff like i<size ; i += 2. Instead, it liked i<size/2 ; i++. GCC and clang can't auto-vectorize loops whose trip-count can't be determined ahead of time. Perhaps GCC has a problem with this because you used unsigned, so i+=2 could wrap back to 0 without ever hitting size, so i<size could be permanently false, i.e. the compiler can't prove your loop isn't infinite because size = UINT_MAX is possible. (Which disables some optimizations compilers like to do, although at least it's unsigned so we don't have to redo sign extension.)

Clang managed to vectorize anyway (poorly: https://godbolt.org/z/b4G4jojn1); possibly it realized that evens[i] would be UB if greater than the constant MAX_SIZE, or else it just didn't care.


The temporary arrays seem unnecessary; I think you were only using them to try to give GCC multiple simpler problems to vectorize?

// __restrict is optional; it promises the compiler input and output won't overlap
// it still vectorizes without it, but does a check for overlap

void g(const unsigned char *__restrict input, const unsigned size, unsigned char *__restrict output)
{
    for (unsigned i = 0 ; size/2 > i; i++) {
        output[i] = (input[2*i] << 4) | input[2*i+1];
    }
}

Without __restrict, on overlap it falls back to a scalar loop. In the case of input = output exactly, the vector version is still safe. I didn't test or reverse-engineer the overlap check to see whether it uses the vectorized version or not in that case. (It would be C++ UB to use it with input=output with __restrict, though.)

GCC11.2 -O3 -march=haswell auto-vectorizes this fairly reasonably (Godbolt); some missed optimizations but not as bad as with separate loops, and of course avoids touching new stack memory. The main inner loop looks like this:

# GCC11 -O3 -march=haswell
# before loop, YMM3 = _mm256_set1_epi16(0x00FF)
.L4:                                              # do{
        vpand   ymm1, ymm3, YMMWORD PTR [rcx+32+rax*2]  # why not reuse the load results for both odd/even?  fortunately modern CPUs have good L1d bandwidth
        vpand   ymm0, ymm3, YMMWORD PTR [rcx+rax*2]     # evens: load input[2*(i+0..31)]  and AND away the high bytes for pack
        vmovdqu ymm4, YMMWORD PTR [rcx+rax*2]       # load 2 vectors of input data
        vmovdqu ymm5, YMMWORD PTR [rcx+32+rax*2]
        vpackuswb       ymm0, ymm0, ymm1            # evens: pack evens down to single bytes.
        vpsrlw  ymm2, ymm5, 8                       # odds: shift down to line up with evens
        vpsrlw  ymm1, ymm4, 8
        vpermq  ymm0, ymm0, 216                     # evens: lane-crossing fixup
        vpaddb  ymm0, ymm0, ymm0                # evens <<= 1 byte shift (x86 SIMD lacks a vpsllb, even with AVX-512)
        vpackuswb       ymm1, ymm1, ymm2            # odds: pack
        vpaddb  ymm0, ymm0, ymm0                # evens <<= 1
        vpermq  ymm1, ymm1, 216                     # odds: lane-crossing fixup
        vpaddb  ymm0, ymm0, ymm0                # evens <<= 1
        vpaddb  ymm0, ymm0, ymm0                # evens <<= 1
        vpor    ymm0, ymm0, ymm1                    # (evens<<4) | odds
        vmovdqu YMMWORD PTR [rdi+rax], ymm0         # store to output
        add     rax, 32                             # advance output position by 32 bytes.  (Input positions scale by 2)
        cmp     rdx, rax
        jne     .L4                              # } while(i != size/2)

It would have been faster if GCC had chosen to mask with 0x000F instead of 0x00FF before packing, so the packed evens could be left-shifted with vpsllw instead of 4x vpaddb without spilling any non-zero bits into the next byte. Or just shift and AND again; that's the standard way to emulate the non-existent vpsllb.

Or even better, OR together high and low within each word before packing down to bytes.

# manually vectorized;  what GCC could have done in theory
# if using intrinsics, this strategy is probably good.
    vmovdqu  ymm0, [mem]
    vmovdqu  ymm1, [mem+32]
    vpsllw   ymm2, ymm0, 12          # evens: line up with odds, and do the <<4
    vpsllw   ymm3, ymm1, 12
    vpor     ymm0, ymm0, ymm2        # odds |= (evens<<4) in the high byte of each word
    vpor     ymm1, ymm1, ymm3
    vpsrlw   ymm0, ymm0, 8           # shift merged to bottom of word
    vpsrlw   ymm1, ymm1, 8
    vpackuswb ymm0, ymm0, ymm1       # and pack
    vpermq   ymm0, ymm0, 0xDB   # same 216
    vmovdqu [mem], ymm0
    .. pointer increment / loop condition

Notice that we avoided an AND constant; both halves needed shifting anyway (odd to be in the right place for pack, even because of <<4). Shifting after packing would mean half as much data to shift, but would have needed masking after the shift so it's break except for back-end port pressure on ALU ports with shift units. (https://agner.org/optimize/ ; https://uops.info/). But merging before packing saves shuffles, and that's a bigger throughput bottleneck on Intel CPUs.


If we can add instead of OR (because we know there aren't overlapping bits so it's equivalent), we could 2x vpmaddubsw (_mm256_maddubs_epi16) using the signed (second) operand as a _mm256_set1_epi16(0x0110) and the unsigned (first) input holding data from the array to do input[2*i+1] + (input[2*i] * 16) within each byte pair. Then AND and VPACKUSWB / VPERMQ from words down to byte elements and store.

Upvotes: 3

Igor Zhukov
Igor Zhukov

Reputation: 191

Oh, I found @Peter Cordes 's comment and I combined with my initial answer:

https://gcc.godbolt.org/z/bxzsfxPGx

and -fopt-info-vec-missed doesn't say anything to me

void f(const unsigned char *input, const unsigned size, unsigned char *output) {
    constexpr unsigned MAX_SIZE = 2000;
    unsigned char odd[MAX_SIZE / 2];
    unsigned char even[MAX_SIZE / 2];
    for (unsigned i = 0, j = 0; size > i; i += 2, ++j) {
        even[j] = input[i];
        odd[j] = input[i + 1];
    }

    for (unsigned i = 0; size / 2 > i; ++i) {
        output[i] = (even[i] << 4) | odd[i];
    }
}

Upvotes: 2

Related Questions