Reputation: 153
I am running the same executable on a HPC cluster with different input arguments. Usually I submit several hundreds of jobs at once (using job arrays or bash loops). Some jobs suddenly crash with a BUS ERROR message:
/var/spool/slurmd/job58791836/slurm_script: line 193: 3086318 Bus error
(core dumped) ./${exec_name}.o
"${@:14}" -L $L -J $J -J0 $J0 -g $g -g0 $g0 -h $h -w $wx -th $thread_num
-m 1 -r $r -k $k_sym -p $p_sym -x $x_sym
-op $operator -fun $fun -s $site -b 0 -ch $ch -seed $seed -jobid $jobid -q_ipr 2.0 1>&${filename}.log
The submitted jobs require at most ~4GB of memory, however, I am allocating at least 12GB and the error persists.
My code is based on ARMADILLO C++ and I compile it using:
icpx main.cpp XYZ_UI.cpp XYZ_sym.cpp XYZ.cpp -o ${exec_name}.o\
-pthread -lhdf5 -Wall -Wformat=0 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core\
-liomp5 -lpthread -lm -ldl -lmkl_sequential -lstdc++fs -fopenmp -std=c++2a -O3 ${compile_suffix} "${@:1}"
My concern is that most of the nodes on the cluster I use are AMD-based and the intel compiler might use the intel optimisation instructions developed for intel cpus.
I used valgrind to check whether there is any memory leak and it only found still reachable
, which should not produce any problems for the code.
Here is the output by valgrind:
https://pastebin.com/0mqR6gNx
Is there anything wrong with compiling a c++ code with intel for AMD cpus? Is there some other possiblility for a bus error to occur other than cpu mismatch or memory allocation problems? Can the buss error occur due to using the same executable and accessing some compiled shared libraries?
I reviewed several forums, but none seem to apply to my problem:
What is a bus error? Is it different from a segmentation fault?
https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101
EDIT: The program fails immediately, there is no output from the code (I print the input parameters as you can see in the valgrind output in the pastebin link) created before the SIGBUS is triggered.
Upvotes: 0
Views: 898