Reputation: 11
Well, I have some trouble with Intel Compiler optimization (ICC).
Generaly I want to use ICC loop auto vectotization. Earlier I used explicitly vectorized loops and functions. And as I know Intel compiler allows to have scalar and corresponding vectorized function by _declspec(vector_variant())
derective. But I have some issues with this.
For example, for now I have both versions of function:
int plus(int a, int b)
{
return a + b;
}
__m256i plus_avx(__m256i a, __m256i b)
{
return _mm256_add_epi32(a, b);
}
int main()
{
int aa[1000] = { 2 };
int bb[1000] = { 4 };
int cc[1000] = { 0 };
for (int i = 0; i < 1000; ++i)
cc[i] = plus(aa[i], bb[i]);
}
And I want ICC uses vectorized version of function for auto vectorized loop.
I tried to use __declspec(vector_variant())
like:
_declspec(vector_variant(implements(plus(int a, int b)), vectorlength(8))) __m256i plus_avx(__m256i a, __m256i b)
{
return _mm256_add_epi32(a, b);
}
but I had error:
1>error #15508: Incorrect return type of vector variant '?plus_avx@@YA?AT__m256i@@T1@0@Z' of function '?plus@@YAHJH@Z' at position 0.
1> The correct prototype is: '__m128i, __m128i ?plus_avx@@YA?AT__m256i@@T1@0@Z(__m128i v0_0, __m128i v0_1, __m128i v1_0, __m128i v1_1)'.
Why does compiler require __m128i and is there way to use __m256i insted of __m128i
Note: It's used /QaxCORE-AVX2 flag for ICC.
Upvotes: 0
Views: 71
Reputation: 11
Finally I solved my issue. Maybe it will be interesting for others.
Solution was using processor clause:
_declspec(vector_variant(implements(plus(int a, int b)), vectorlength(8), processor(core_4th_gen_avx)))
Upvotes: 1
Reputation: 365727
AFAIK, there's no way to get auto-vectorization to use custom primitives. Just tell your compiler your arrays are aligned, and let it auto-vectorize from the pure ISO C++ using +
, not _mm256_add_epi32
.
Compiler options control the default vector width for auto-vectorization. (e.g. targeting skylake-AVX512, ICC and some other compilers default to using 256b vectors, because actually using 512b vectors reduces the max turbo clock, and only worth it if the program will spend most of its time in vectorized loops. The compiler doesn't know this.)
For example, gcc has a -mprefer-avx128
to auto-vectorize with 128-bit AVX instructions even when AVX2 256bit integer instructions are available.
ICC18 introduced -qopt-zmm-usage:low|high
, defaulting to low
for Skylake-server and high
for Xeon-Phi, because Xeon-Phi is designed around 512-bit vectors, but Skylake-AVX512 has to lower its max turbo to run 512-bit instructions. (There's no penalty for using AVX512 instructions on narrower vectors on Skylake, but Xeon Phi doesn't even allow that because it doesn't support the AVX512VL (vector-legnth) subset of AVX512.)
I'm not sure with ICC how to control 128-bit vs. 256-bit auto-vectorization, but it's definitely not by writing functions using intrinsics.
Upvotes: 0