Reputation: 9900
My CPU supports all sorts of things
-march=CPU[,+EXTENSION...]
generate code for CPU and EXTENSION, CPU is one of:
generic32, generic64, i386, i486, i586, i686,
pentium, pentiumpro, pentiumii, pentiumiii, pentium4,
prescott, nocona, core, core2, corei7, l1om, k1om,
iamcu, k6, k6_2, athlon, opteron, k8, amdfam10,
bdver1, bdver2, bdver3, bdver4, znver1, btver1,
btver2
EXTENSION is combination of:
8087, 287, 387, 687, mmx, sse, sse2, sse3, ssse3,
sse4.1, sse4.2, sse4, avx, avx2, avx512f, avx512cd,
avx512er, avx512pf, avx512dq, avx512bw, avx512vl,
vmx, vmfunc, smx, xsave, xsaveopt, xsavec, xsaves,
aes, pclmul, fsgsbase, rdrnd, f16c, bmi2, fma, fma4,
xop, lwp, movbe, cx16, ept, lzcnt, hle, rtm, invpcid,
clflush, nop, syscall, rdtscp, 3dnow, 3dnowa,
padlock, svme, sse4a, abm, bmi, tbm, adx, rdseed,
prfchw, smap, mpx, sha, clflushopt, prefetchwt1, se1,
clwb, avx512ifma, avx512vbmi, avx512_4fmaps,
avx512_4vnniw, avx512_vpopcntdq, clzero, mwaitx,
ospke, rdpid, ptwrite, cet, no87, no287, no387,
no687, nommx, nosse, nosse2, nosse3, nossse3,
nosse4.1, nosse4.2, nosse4, noavx, noavx2, noavx512f,
noavx512cd, noavx512er, noavx512pf, noavx512dq,
noavx512bw, noavx512vl, noavx512ifma, noavx512vbmi,
noavx512_4fmaps, noavx512_4vnniw, noavx512_vpopcntdq
Yet, something as simple as __m256h inter;
yields an error: '__m256h' was not declared in this scope
. Which makes sense hense CPU requirement is a CPUID Flags: AVX512_FP16 + AVX512VL
where AVX512_FP16
is not on the list.
How does one get AVX512_FP16
support? Is it CPU version dependent or can it be fixed with a patch?
Update: intel mentions that AVX512_FP16 is only supported alongside AVX512BW [check]. I am compiling using -march=skylake-avx512
which compiles regular __m512
but fails speficically on these FP16 based ops.
Upvotes: 0
Views: 3176
Reputation: 81
Because AVX512FP16 is an extension to the AVX512 ISA, it must either:
A) Have explicit hardware support built in.
B) Be emulated in software by promoting the type to another suitable alternative such as fp32 with specific rounding/conformance code.
As of the time of your posting there were no systems in the market that had AVX 512 FP16 support available.
As of this posting (Feb 10 2022) the only in market support is the AVX512 P(erformance)-core workaround for Intel 12th generation K series AlderLake CPU's*.
These P-cores, based on the Golden Cove architecture, support AVX512FP16*.
To use the instruction in C or C++ a very recent compiler must be used. My own testing shows that GCC-12, Clang-14 and ICX 2022.0 are all capable of utilizing the instruction.
If you'd like to use an officially supported platform, the option is to wait for Intel Xeon Sapphire Rapids, which are based on only Golden Cove cores and will have the full AVX512 ISA enabled.
A snippet of code that will compile to utilize the FMA instructions from the AVX512FP16 ISA extension is at the end with instructions on it's usage.
*NB: This capability can only be enabled once Gracemont E-cores are disabled, on specific vendors motherboards with specific BIOS/Microcode revisions. This is not Sanctioned or supported by Intel
The reason for this is mainly to do with different ISA's between the Gracemont and Golden Cove core's and Process pinning (but that is beyond the scope of this question)
Use gcc-12 fp16_FMA_avx512.c -O3 -march=sapphirerapids -mavx512fp16 -o avx512example.bin
To generate an executable if your platform supports the instruction
Use gcc-12 fp16_FMA_avx512.c -O3 -march=sapphirerapids -mavx512fp16 -o avx512example.S -S
To generate an assembly file that shows the usage of the instructions themselves.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/*
Simple example of FP16 arithmetic with it's declaration
NB: This uses Clang/GCC convention FP16 declarations due to near universal platform support.
Any compiler that has yet to formally adopt ISO/IEC TS 18661-3:2015 (“Floating-point extensions for C”) will not support the type.
Known working x86_64 compilers as of Feb 08 2022 are:
Clang/LLVM-14+
GCC-12+
Intel ICX Version 2022.0.0
Known working architectures:
Intel Alder-Lake [ *under certain conditions]
Intel Sapphire Rapids
*/
int main(){
float seed = 1;
srand((time(0)));
int count = 31;
_Float16 factor = seed;
//primaries
_Float16 a=1.436;
_Float16 b=0.83546;
//arrays to be used for FMA
_Float16 alpha[32];
_Float16 delta[32];
_Float16 omega[32];
while (count>=0)
{
//fill the arrays with differing values
alpha[count]=(_Float16) (a*factor);
delta[count]=(_Float16) (b*factor);
omega[count]=(_Float16) (factor+(a*b));
factor = factor+b;
count--;
}
printf("Print the FMA of 3 _Float16's that are cast as Float\n");
while (count < 32){
omega[count]=(omega[count]*alpha[count])+delta[count];
count++;
}
printf("\n"); //clear last line
while (count>=0)
{
printf("%i %f \n", count, (float) omega[count]);
count--;
}
// 32 entry variable can be used: 512bit/16bits per variable = 32 variables
//c d e f g h i j k l m n o p q r s t u v w x y z aa ab ac ad ae af ag ah
}
Upvotes: 6