Thomas
Thomas

Reputation: 6196

Understanding how the instrinsic functions for SSE use memory

Before I ask my question, just a little background information.

In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.

int a = rand(); //conceptually, you created and assigned variable A in ram

In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.

When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.

Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.

You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.

Consider the following example:

inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);

}

There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?

TL:DR How am I suppose to think about memory when using SSE instrinsics?

Upvotes: 2

Views: 1584

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364180

Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).

gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.

You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)

Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.

The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.

Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.

Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.

This came up in comments on How to specify alignment with _mm_mul_ps.

The store intrinsics apparently have two purposes:

  1. To tell the compiler whether it should use the aligned or unaligned asm instruction.
  2. To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).

Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.

Upvotes: 1

Paul R
Paul R

Reputation: 212969

An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.

Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.

Upvotes: 4

Related Questions