Reputation: 701
I am running jobs on a cluster composed of machines with different architectures:
gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3
gives me one of these: broadwell
, haswell
, ivybridge
, sandybridge
or skylake
.
The executable needs to be the same, so I cannot use -march=native
but at the same time the architectures have things in common (I think they all support AVX?).
I am aware that gcc
(contrary to Intel icc
) does not allow for multiple archictures in a single executable. What I would like to know is if there is a way to ask gcc
for the highest set of instructions compatible with all the architectures listed above.
gcc version: 8.1.1
Upvotes: 7
Views: 2214
Reputation: 701
Comments suggested me to look by myself at the 'intersection' between the architectures. The following bash script seems to do the job.
#!/usr/bin/env bash
archs=("broadwell" "haswell" "ivybridge" "sandybridge" "skylake")
for ar in ${archs[@]}; do
gcc -march=$ar -Q --help=target | grep -- " -m" > "$ar.log"
done
cp "${archs[0]}.log" all.log
for ar in ${archs[@]:1}; do
join all.log "$ar.log" > tmp.log
mv tmp.log all.log
done
cat all.log | grep "\[activé]" | grep -v "\[désactivé]" | cut -d' ' -f1 | tr '\n' ' '
(Computer in French: "activé" => "enabled", "désactivé" => "disabled")
The output is
-m128bit-long-double -m64 -m80387 -maes -malign-stringops -mavx -mcx16 -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mhard-float -mieee-fp -mlong-double-80 -mmmx -mpclmul -mpopcnt -mpush-args -mred-zone -msahf -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2 -mssse3 -mstv -mtls-direct-seg-refs -mvzeroupper -mxsave -mxsaveopt
As I expected all the architectures support both SSE and AVX.
Upvotes: 4
Reputation: 365247
Intel hasn't ever removed instruction sets in future versions of the same CPU. i.e. a binary that works on an old Intel CPU always works on a newer Intel CPU.
(The one exception to this is first-gen Xeon Phi: Knight's Corner used an incompatible variant of AVX512 called KNI, but later Xeon Phi accelerator cards / computers use AVX512.)
If you must use the same binary on all CPUs, use gcc -march=sandybridge -mtune=haswell
, and make sure your important arrays are aligned by 32 bytes.
Maybe worth benchmarking with gcc -march=sandybridge
(i.e. with tune=sandybridge) as well, to see which works better for your code. -mprefer-avx128
or -mprefer-vector-width=256
might be interesting to try: some loops get messy when gcc auto-vectorizes with 256-bit vectors.
SnB/IvB have inefficient misaligned AVX loads/stores, so tune=sandybridge sets -mavx256-split-unaligned-load
, which sucks a lot if your data is aligned at runtime but the compiler didn't know that. The extra instructions and shuffles aren't helpful on Haswell, so -mtune=haswell
includes -mno-avx256-split-unaligned-load
.
Unfortunately gcc doesn't have a "tune=avx2" option to tune for all CPUs which have AVX2, or an option to tune for the average CPU which supports the instruction sets you enabled. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568. Your only choices are tune for a specific CPU, or tune for the generic baseline, or use specific tuning options.
ifunc
You have to activate it in the source for specific functions. See https://lwn.net/Articles/691932/ for more about function multi-versioning.
$PATH
settingOn each cluster node, create a /etc/host-type
or whatever, which has sandybridge
or haswell
or whatever. Any per-node filesystem is fine, or re-detect it at run time with gcc
or something cheaper. In your job script:
#!/bin/sh
bin_dir="./bin-$(</etc/node-type)"
exec "$bin_dir/my_prog" "$@"
Create symlinks as necessary to make bin-skylake
and bin-broadwell
use the Haswell binaries.
Haswell introduced AVX2 and FMA, and BMI1/2. If you're number-crunching, you really want FMA. BDW/SKL didn't introduce any significant ISA extensions that compilers can use to make your code run faster. Tuning for BDW/SKL is not different either.
If you have any Skylake-avx512 CPUs, that's different.
Upvotes: 4
Reputation: 12332
What I would like to know is if there is a way to ask gcc for the highest set of instructions compatible with all the architectures listed above.
That's a NO.
If you want optimal performance look into fat binaries as Saner De Dycker commented.
An alternative solution though is to compile binaries and libraries for each instruction set and set PATH and LD_LIBRARY_PATH on each system to pick the best instruction set.
Upvotes: 1