nowox
nowox

Reputation: 29166

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?

I've recently been diving deeper into x86-64 architecture and exploring the capabilities of SSE and AVX. I attempted to write a simple vector addition function like this:

void compute(const float *a, const float *b, float *c) {
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

Using both gcc and clang, I compiled with the following options:

cc -std=c23 -march=native -O3 -ftree-vectorize main.c

However, when I checked the disassembly, the output wasn’t quite what I expected in terms of vectorization:

compute:
  vmovss xmm0, dword ptr [rdi]
  vaddss xmm0, xmm0, dword ptr [rsi]
  vmovss dword ptr [rdx], xmm0
  vmovss xmm0, dword ptr [rdi + 4]
  vaddss xmm0, xmm0, dword ptr [rsi + 4]
  vmovss dword ptr [rdx + 4], xmm0
  vmovss xmm0, dword ptr [rdi + 8]
  vaddss xmm0, xmm0, dword ptr [rsi + 8]
  vmovss dword ptr [rdx + 8], xmm0
  vmovss xmm0, dword ptr [rdi + 12]
  vaddss xmm0, xmm0, dword ptr [rsi + 12]
  vmovss dword ptr [rdx + 12], xmm0
  ret

This seems like scalar code, processing one element at a time. But when I manually use intrinsics, I get the expected vectorized implementation:

#include <xmmintrin.h>

void compute(const float *a, const float *b, float *c) {
    __m128 va = _mm_loadu_ps(a);
    __m128 vb = _mm_loadu_ps(b);
    __m128 vc = _mm_add_ps(va, vb);
    _mm_storeu_ps(c);
}

As I understand it, modern processors are incredibly powerful, and SSE (introduced in 1999) and AVX (since 2011) are now standard. Yet it seems compilers don't always take full advantage of these instructions automatically, even when I explicitly enable optimizations.

It feels a bit like we've invented teleportation, but people still prefer to cross the Atlantic by boat. Is there a rational reason why modern compilers might be hesitant to generate vectorized code for something as straightforward as this?


As Barmar suggested, 4 elements might not be enough to get the benefit of using vectorization. I tried with the following and get the same deceiving results:

float a[512];
float b[512];
float c[512];

void compute() {  
    for (size_t i = 0; i < 512; i++) 
        c[i] = a[i] + b[i];
}

(On Godbolt, GCC -O3 -march=x86-64-v3 does auto-vectorize this with 256-bit AVX instructions.)

Upvotes: 29

Views: 3154

Answers (2)

bazza
bazza

Reputation: 8434

It feels a bit like we've invented teleportation, but people still prefer to cross the Atlantic by boat. Is there a rational reason why modern compilers might be hesitant to generate vectorized code for something as straightforward as this?

Too Many to Choose From

One of the major problems software has had with MMX, SSE, SSE2, AVX, AVX... AVX512 is because there's too many of them! If one want's one's software binaries to be compatible, you use an opcode set for the lowest common denominator (i.e. plain old X64 and its FPU instructions).

Using a Library

Intel did go some way to "automating" use of such extensions with their IPP / MKL libraries. These package up useful routines like FFTs, and the library would work out what implementation applies to the chip the software is running on and uses the best. As a developer, one writes single threaded code around these libraries, and they do all the hardware work of detecting AVX version and multi-threadedness. I think it costs money, which is why no one uses them.

Portability

Another problem is that different architectures do it differently. ARM has its own SIMD instructions.

PowerPC had Altivec, and here actually Motorola (as it was) got things very right. Altivec was fully featured from the get-go, and app developers (on PowerPC Mac, all sorts of high power embedded systems) wrote for Altivec. PS3 was essentially all-Altivec.

Intel's mistake was to "hesitate"; first they did MMX - which was very underfeatured - then SSE, and so forth. Intel deliberately omitted a critical instruction - FMA - from x86/x64 for years. They had it in Itanium, and tried to use its presence there and absence in x86 as a sales point for Itanium. Most used PowerPC instead, and Intel finally caved in in about 2013, and Itanium died shortly thereafter. One of the reasons why Windows 11 has the hardware requirements it does is because MS are desparate to finally use some of these instructions in the operating system.

Anyway, if one want's highly portable C, one doesn't use such instructions directly.

Standardised Libraries

There are libraries beyond IPP/MKL; VSIPL is a standardised library of signal processing routines that is the same on all implementations. You can get it for Intel, Arm, PowerPC and other CPUs (for money), and if you write your C to use VSIPL then you can compile it for most popular CPUs and link it against a VSIPL implementation that knows how to get the very most out the CPU. VSIPL was standardised by the Department of Defense - who have long had a big interest in portable, highly performant code. There are proprietary equivalents - Mercury Computer Systems (for disambiguation: mrcy.com) has SAL, which is a similar idea to VSIPL and pretty good.

The GNU Radio project has spun out a library called VOLK, which is an OSS attempt at a maths library that uses AVX, Arm's equivalent, etc. However, they've made up their own library API, instead of using a standard one like VSIPL.

Net Result

In general, getting good usage of these types of instruction pays big benefits. A big modern ThreadRipper or chunky Xeon running a lot of carefully optimised AVX code (where "optimised" means "getting the caches working well") is a prodigiously powerful number cruncher.

In real world applications they're still a pretty big deal, because CPUs are pretty good at stream processing (data can be being DMA'd into one part of memory by a sensor peripheral, whilst the CPU is continuously processing data already delivered). I have built and delivered such systems featuring large collections of Xeons with Preempt_RT added to Linux, and with careful software structuring had all CPU cores on all the CPUs pegged at 95% utlisation, most of it being all SSE or AVX instructions buried in optimised libraries.

==EDIT==

Compilers, Operating Systems, CPUs

There's a large focus on compilers and autovectorisation in this question that commentators seem keen on. However, as Peter Cordes can attest from this StackOverflow Question there is a whole lot more to "performance" (the motivation for autovectorisation) than just writing code that the compiler can hopefully autovectorise.

The answer to that question, given by Maxim Egorushkin, was basically that the use of AVX instructions kicked the CPU down to a lower clock rate for thermal protection reasons, for too little work done using AVX.

OS context switch time matters too; see this PDF from Intel on linuxfoundation.org). If a process uses more CPU registers (such as AVX's), the process context is larger (obviously). A compiler achieving autovectorisation for a section of code forces the OS that's running it to save more context on preemption, displacing application data from L1 cache. With the context being 2,560 bytes it's 7.8% of a 32k L1 cache, per process. That's not insignificant.

Really, on modern CPUs, vectorised code makes sense only if one is doing a decent amount of work in the SIMD unit. Turning on autovectorisation does not inevitably achieve that (especially in the absence of long loops), can be detrimental to performance in complicated ways, and if distributing a binary one then has compatibility constraints.

That's probably why people don't make as much use of autovectorisation as all that.

That's also why there are libraries such as Intel's IPP/MKL that do understand machine behaviour, and at runtime choose the best approach to a particular function when it's run, guaranteeing an optmised result on every machine. One could emulate that with autovectorisation perhaps, if there were for each basic CPU architecture a -march-all option (and you ended up with a fat binary containing lots of different renderings of the same source).

The Intel library will also auto-thread functions - if it's worthwhile on the hosting CPU for the given input parameters. You can't match that with autovectorisation.

Other libraries (like VSIPL implementations) are coded for a particular op code set and, sometimes, L1/2/3 cache sizes. Such specificity then becomes a constraint on binary portability, but source code remains portabile.

Upvotes: -3

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50846

Aliasing is the problem here. The compiler cannot know whether a, b and c related memory regions can overlap or not. Compilers will sometimes generate code that checks for overlap at run-time and chooses a vectorized or scalar loop. But here with tiny arrays it's not worth the overhead of checking: it would make the function slower.1

The restrict keyword is meant to address this issue. Here is an example working on all mainstream compilers:

void compute(const float * a, const float * b, float * restrict c)
{
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

Note restrict can also be applied to a and b, but this is not needed here, as pointed out by chtz in comments. For more information about that, please read this related post.


The provided code with global arrays does generate SIMD assembly code on Godbolt. Note the GCC code is not unrolled but this is another problem (which can be addressed with directives like #pragma GCC unroll(4)).


Footnote 1: Actually, when written as a loop instead of manually unrolled, GCC14 is very eager to vectorize with an overlap check, even for only 4 floats (1 vector) even with just -O2 -ftree-vectorize. Godbolt. Clang 19's threshold is 25 floats at -O2, but unfortunately at -O3 it's willing to fully unroll scalar up to 49 floats, only check+vectorizing at 50 with the default -mtune=generic. So GCC's cost / benefit heuristics must weight loop vectorization a lot higher even than combining loose statements into a vector operation, even for very small loop counts.

Realistically it's probably only worth it to check for overlap and vectorize with maybe 12 or 16 elements (assuming that non-overlap is the typical case and predicts well), if the code can't or didn't use restrict to promise non-overlap. Unless later code will do vector loads from the result, which could lead to store-forwarding stalls if we did scalar stores, giving more benefits to vectorizing even 4 or 8 elements.

Upvotes: 47

Related Questions