Reputation: 2165
I have compiled the same Fortran libraries and code in both aarch64 and x86_64. It is a model that runs algorithms across n-dimensional arrays / matrices. The ARM CPU is the Amazon Graviton2. AMD & Intel options in AWS produce identical results when the code is compiled and run for x86_64.
I'm using gcc / g++ / gfortran / mpich with the following flags (all version 8.3.0, from debian buster's main repos)
-O2 -ftree-vectorize -funroll-loops -w -ffree-form -ffree-line-length-none -fconvert=big-endian -frecord-marker=4
It all compiles and runs fine, however, i notice in the output of the model, the results differ very slightly. It seems to be a matter of precision or rounding, as most values are the same between output. However, there are (seemingly) random values throughout the output where it looks like the code compiled for one arch either rounded down or truncated and the other arch rounded up.
The output is stored as NetCDF (using NetCDF-Fortran version 4.5.3) and the md5sum of the files is the same across x86_64 CPUs but differs on aarch64.
Any ideas of why this might be happening? Or any flags I can use during compilation to ensure that I get identical results across architectures?
The values I'm looking at now have a precision of 5 decimal places, i.e 123.12345
Here is a snippet from a diff
of the output where you can see that most values are identical but a few seem to have been rounded differently (I've marked the differing values with **):
657c657
< 18.83633, 18.83212, 18.82778, **18.82337**, 18.81886, 18.81425, 18.80956,
---
> 18.83633, 18.83212, 18.82778, **18.82336**, 18.81886, 18.81425, 18.80956,
1151c1151
< 17.35448, 17.37331, 17.39206, 17.41071, 17.42931, **17.4478**, 17.46622,
---
> 17.35448, 17.37331, 17.39206, 17.41071, 17.42931, **17.44779**, 17.46622,
1711c1711
< 19.77562, 19.77532, 19.77493, 19.77445, 19.77386, 19.77319, **19.77241**,
---
> 19.77562, 19.77532, 19.77493, 19.77445, 19.77386, 19.77319, **19.77242**,
2130c2130
< 20.06532, 20.06839, **20.07135**, 20.07423, 20.07702, 20.0797, 20.0823,
---
> 20.06532, 20.06839, **20.07136**, 20.07423, 20.07702, 20.0797, 20.0823,
2140c2140
< 20.04788, 20.04424, 20.04047, **20.03661**, 20.03268, 20.02863, 20.02448,
---
> 20.04788, 20.04424, 20.04047, **20.03662**, 20.03268, 20.02863, 20.02448,
2600c2600
< 11.54104, 11.57732, 11.61352, 11.6497, 11.68579, **11.72186**, 11.75784,
---
> 11.54104, 11.57732, 11.61352, 11.6497, 11.68579, **11.72185**, 11.75784,
Upvotes: 4
Views: 3929
Reputation: 6105
If the code only uses basic arithmetic operations such as +, -, *, / and sqrt and the compiler is in IEEE754 conformance mode, the output should be bit-identical regardless of the CPU used. This IEEE754 conformance mode is usually the default setting.
Otherwise the issue is probably caused by a compiler or CPU bug.
Options such as -ffast-math
put the compiler in non-IEEE 754 conformance mode.
It uses then mathematical equivalence rules to optimize the code, which are not necessarily numerically equivalent (e.g., ((a*a)*a)*a -> (a*a)*(a*a)
and such).
If this is the case and the ARM code is optimized differently by the compiler than the x86_64 this may be an explanation.
Also if the code uses functions such as sin
, cos
, exp
atan2
and such, the output will only be bit-identical if the exact same run-time library is used. This is because these functions are not correctly rounded and results typically have a tiny error (which may amplify in the calculation and show up in the way you observe it).
It also might be the case that for x86_64 special CPU instructions for these functions are used and for ARM a software implementation or vice versa. Note that even if these functions are implemented on the CPU/FPU they are also not correctly rounded and very likely different algorithms are used.
TL/DR: check the compiler flags for -ffast-math
or try adding -fno-fast-math
at the end of the options.
EDIT: As @Rob mentioned in the comment another flag that could be added -ffp-contract=off
. In gcc it is by default 'fast' (independent on -ffast-math
) which may generate the FMA instruction even when not explicitly requested. This also breaks 754 conformance.
Upvotes: 4