Reputation: 8283
To my delight, I found that clang will let you write explicit vector code, without resorting to intrinsics, using extended vectors.
For instance, this code:
typedef float floatx16 __attribute__((ext_vector_type(16)));
floatx16 add( floatx16 a, floatx16 b )
{
return a+b;
}
...will translate directly to a single instruction with clang -march=skylake-avx512
invocation:
vaddps zmm0, zmm0, zmm1
In order to write branch-free code, I want to blend avx512 vectors.
With intrinsics, you would use the _mm512_mask_blend_ps
intrinsic. (By the way, why is does AVX512 use mask,a,b order, and AVX use a,b,mask order?)
Trying to do the blend with the ternary operator does not work:
typedef float floatx16 __attribute__((ext_vector_type(16)));
floatx16 minimum( floatx16 a, floatx16 b )
{
return a < b ? a : b;
}
...results in...
error: used type 'int __attribute__((ext_vector_type(16)))' (vector of 16 'int' values) where arithmetic or pointer type is required
Is it possible to do vector blending, vblendmps zmm {k}, zmm, zmm
, using ext_vector_type(16) variables in C?
Upvotes: 3
Views: 867
Reputation: 8283
(This is the comment by @chtz in answer-form:)
There are at least two different ways to do vector types:
Form A:
__attribute__ ( ( ext_vector_type(numelements) ) );
Form B:
__attribute__( ( vector_size(numbytes) ) );
When using form A, the expression c ? x : y
will cause a compile error with clang 11.
Worse than that, gcc 10 will just silently pretend that ext_vector_type(N) has 4 elements even if N is 8 or 16.
When using form B, the expression c ? x : y
is properly translated into a vector blend by clang 11. Clang 10 and gcc 10 translate it into something different though, but they are both able to compile it.
It is unclear to me why the ext_vector_type form exists, especially considering how badly it works.
Ugh... this only works in C++ but not in C. WHY???
The difference in behaviour is in the specification.
Upvotes: 3