Reputation: 9
const __m128i ___n = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x80808080 );
const __m128i w___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x0f0e0d0c );
const __m128i z___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x0b0a0908 );
const __m128i zw__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0f0e0d0c, 0x0b0a0908 );
const __m128i y___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x07060504 );
const __m128i yw__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0f0e0d0c, 0x07060504 );
const __m128i yz__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0b0a0908, 0x07060504 );
const __m128i yzw_ = _mm_set_epi32( 0x80808080, 0x0f0e0d0c, 0x0b0a0908, 0x07060504 );
const __m128i x___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x03020100 );
const __m128i xw__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0f0e0d0c, 0x03020100 );
const __m128i xz__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0b0a0908, 0x03020100 );
const __m128i xzw_ = _mm_set_epi32( 0x80808080, 0x0f0e0d0c, 0x0b0a0908, 0x03020100 );
const __m128i xy__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x07060504, 0x03020100 );
const __m128i xyw_ = _mm_set_epi32( 0x80808080, 0x0f0e0d0c, 0x07060504, 0x03020100 );
const __m128i xyz_ = _mm_set_epi32( 0x80808080, 0x0b0a0908, 0x07060504, 0x03020100 );
const __m128i xyzw = _mm_set_epi32( 0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100 );
const __m128i LUT[16] = { ___n, x___, y___, xy__, z___, xz__, yz__, xyz_, w___, xw__, yw__, xyw_, zw__, xzw_, yzw_, xyzw };
I use a lookup table like the one above for the SSE/SSSE3 version of a compare and left pack routine. A set of compares happens multiple (60) times per second and I'd like it to be in memory instead of set each time and have scope limited to one .c file but trying to define it out of a function with or without static yields the error: initializer element is not constant. for each 'set'. Why does this happen and how can I do this properly?
Upvotes: 1
Views: 286
Reputation: 365267
_mm_set_epi32
with compile-time-constant args gets optimized to a vector constant, typically loaded from memory. You don't need to help the compiler with this, and in fact it's worse if you try because compilers are strangely bad at it. If you do need/want to help the compiler with constant layout, use an array of some type like alignas(16) static const int32_t LUT[] = {...};
and use _mm_load_si128( (__m128i*)&LUT[i*4] )
or something like that.
Simple vectors like all-zero or all-1 bits can even get materialized in a register with pxor xmm0,xmm0
or pcmpeqd xmm1, xmm1
even more efficiently than a load, so making sure the compiler can see the constant values when optimizing a function is a good reason to define your constants with _mm_set*
inside functions.
Think of _mm_set_epi32
as being like a string literal: the compiler figures out where to put it, and will even do duplicate merging if multiple functions need the same vector constant. (Unlike a string literal, it's a value, not something that decays to a pointer to the storage, so only parts of that analogy work.)
The only good place for _mm_set*
is inside a function; current compilers suck at handling global / static
variable of __m128i
type efficiently, failing to handle it as a static initializer so they actually put an anonymous vector constant in section .rodata
, and have a run-time constructor / initializer function copy from that to space in the .bss
for the named variable in static storage.
(In C, this can only happen for a static __m128i
inside a function, which makes it need a guard variable. In C++, non-constant global initializers are allowed, like int foo = bar(123);
at global scope even if bar
isn't constexpr. In C, you get the error you're running into.)
For example:
#include <immintrin.h>
__m128i foo() {
return _mm_setr_epi32(1,2,3,4);
}
Compiles with GCC11.2 -O3 (on Godbolt) to this asm. (clang and ICC, and I think MSVC, are all similar for all of the following code blocks.)
# in .text
foo():
movdqa xmm0, XMMWORD PTR .LC0[rip]
ret
# in .rodata
.LC0:
.quad 8589934593 # 0x200000001
.quad 17179869187 # 0x400000003
__m128i bar(__m128i v) {
return _mm_add_epi32(v, _mm_setr_epi32(1,2,3,4));
}
bar(long long __vector(2)):
paddd xmm0, XMMWORD PTR .LC1[rip]
ret
.set .LC1,.LC0 # make LC1 a synonym for LC0
# GCC noticed and merged at compile time, not leaving it for the linker.
__m128i retzero() {
//return _mm_setzero_si128();
return _mm_setr_epi32(0,0,0,0); // optimizes the same
}
pxor xmm0, xmm0
ret
But here's what you get from a C++ compiler for a global vector:
__m128i globvec = _mm_setr_epi32(1,2,3,4);
_GLOBAL__sub_I_foo(): # static initializer code
movdqa xmm0, XMMWORD PTR .LC0[rip] # copy from anonymous .rodata
movaps XMMWORD PTR globvec[rip], xmm0 # to named .bss space
ret
# in .bss
globvec:
.zero 16
This is what the C error message is "saving" you from; C doesn't allow non-constant static initializers (except in functions).
For a static __m128i
inside a function, it would be even worse: you'd get a guard variable to make sure the non-constant initializer ran only in the first call.
Code actually using globvec
is basically fine, it will use it as a memory source operand or load it normally, but any optimizations e.g. based on some elements having a known value for constant-propagation through some operations won't be possible. You're also using twice as much space, although the initializer data is only touched once during startup so there isn't an impact on cache footprint.
Upvotes: 3
Reputation: 223633
The compiler reports that an initializer element is not constant because _mm_set_epi32
is a function call and does not satisfy the “constant” requirements for initializers. Also, the various variables ___n
and such that you define do not qualify as constants for initializing LUT
.
You can define your array with:
const __v4su LUT[16] =
{
{ 0x80808080, 0x80808080, 0x80808080, 0x80808080 },
{ 0x03020100, 0x80808080, 0x80808080, 0x80808080 },
{ 0x07060504, 0x80808080, 0x80808080, 0x80808080 },
{ 0x03020100, 0x07060504, 0x80808080, 0x80808080 },
{ 0x0b0a0908, 0x80808080, 0x80808080, 0x80808080 },
{ 0x03020100, 0x0b0a0908, 0x80808080, 0x80808080 },
{ 0x07060504, 0x0b0a0908, 0x80808080, 0x80808080 },
{ 0x03020100, 0x07060504, 0x0b0a0908, 0x80808080 },
{ 0x0f0e0d0c, 0x80808080, 0x80808080, 0x80808080 },
{ 0x03020100, 0x0f0e0d0c, 0x80808080, 0x80808080 },
{ 0x07060504, 0x0f0e0d0c, 0x80808080, 0x80808080 },
{ 0x03020100, 0x07060504, 0x0f0e0d0c, 0x80808080 },
{ 0x0b0a0908, 0x0f0e0d0c, 0x80808080, 0x80808080 },
{ 0x03020100, 0x0b0a0908, 0x0f0e0d0c, 0x80808080 },
{ 0x07060504, 0x0b0a0908, 0x0f0e0d0c, 0x80808080 },
{ 0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c },
};
Since the type is changed from __m128i
to __v4su
, you may need casts to work with it. (The vector operations are designed for this.)
However, defining it outside of any function and/or with static storage duration does not ensure it will be in memory.
Upvotes: 4