Simon F
Simon F

Reputation: 1055

gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

Background: I develop a computationally intensive tool, written in C/C++, that has to be able to be run on a variety of different x86_64 processors. To speed the calculations which are both float and integer, the code contains rather a lot of SSE* intrinsics with different paths tailored to different CPU SSE capabilities. (As the CPU flags are detected at the start of the program and used to set Booleans, I've assumed that the branch prediction for the tailored blocks of code will work very effectively).

For simplicity I've assumed only SSE2 through to SSE4.2 need to be considered.

In order to access SSE4.2 intrinsics fpr the 4.2 paths, I need to use gcc's -msse4.2 option.

The problem The issue I'm having is that, at least with 6.1.0, gcc goes and implements the sse2 intrinsic, mm_cvtsi32_si128, with the sse4.2 instruction, pinsrd.

If I limit the compilation by using -msse2, it will use the sse2 instruction, movd, ie. the one that the intel "intrinsics guide" says it's supposed to use.

This is annoying on two counts.

1) The critical problem is that the program now crashes with an illegal instruction when it gets run on a pre4.2 CPU. I don't have control over what HW is used so the executable needs to be compatible with older machines, yet needs to take advantage of features on newer HW where available.

2) According to the Intel intrinsics guide, the pinsrd instruction is quite a lot slower than the mov it replaces. (pinsrd is more general but this is not needed).

Does anyone know how to make gcc just use the instructions that the intrinsics guide says should be used yet still allow access to all SSE2 through SSE4* in the same compilation unit?

Update: I should also note the same code is compiled under Linux, Windows and OSX using a variety of different compilers so would rather like to avoid or at least have the fewest compiler-specific extensions if possible.

Update2: (Thanks to @PeterCordes) Seems that if optimisation is enabled, gcc will revert back to using movd from pinsrd where appropriate.

Upvotes: 4

Views: 1018

Answers (1)

Jason R
Jason R

Reputation: 11696

If you give the -msse4.2 flag to gcc's command line during a compilation step, it will assume that it is free to use up to the SSE 4.2 instruction set for the entire translation unit. This can lead to the behavior that you described. If you need code that only uses SSE2 and below code, then using -msse2 (or no flag at all if you're building for x86_64) is required.

Some options that I can think of are:

  • If you can easily break down your code at the function level, then gcc's multiversioning feature can help. It requires a relatively recent version of the compiler, but it allows you to do things like this (taken from the link above):

     __attribute__ ((target ("default")))
     int foo ()
     {
       // The default version of foo.
       return 0;
     }
    
     __attribute__ ((target ("sse4.2")))
     int foo ()
     {
       // foo version for SSE4.2
       return 1;
     }
    
     __attribute__ ((target ("arch=atom")))
     int foo ()
     {
       // foo version for the Intel ATOM processor
       return 2;
     }
    
     __attribute__ ((target ("arch=amdfam10")))
     int foo ()
     {
       // foo version for the AMD Family 0x10 processors.
       return 3;
     }
    
     int main ()
     {
       int (*p)() = &foo;
       assert ((*p) () == foo ());
       return 0;
     }
    

    In this example, gcc will automatically compile the different versions of foo() and dispatch to the appropriate one at runtime based on the CPU's capabilities.

  • You can break the different implementations (SSE2, SSE4.2, etc.) into different translation units, then dispatch appropriately to the right implementation at runtime.

  • You can put all of the SIMD code into a shared library and build the shared library multiple times with different compiler flags. Then at runtime, you can detect the CPU's capabilities and load the appropriate version of the shared library. This is the approach taken by libraries like Intel's Math Kernel Library.

Upvotes: 6

Related Questions