Aron
Aron

Reputation: 11

mpiexec error on HPC: execvp error on file srun (No such file or directory)

On the HPC, I first test

apptainer exec my_container bash -c "activate environment; python3 script.py"

It worked well.

Then it also works well if I use terminal interactively like:

apptainer shell my_container
activate environment
mpiexec -n 5 python3 script.py

But, if I tried to write a slurm script with

apptainer exec my_container bash -c "activate environment;mpiexec -n 5 python3 script.py"

I got the following message: [mpiexec@acn89] HYDU_create_process (lib/utils/launch.c:73): execvp error on file srun (No such file or directory)

Any helps are greatly appreciated.

Upvotes: 1

Views: 357

Answers (1)

A user from the HPC cluster which I manage recently raised this same error to me, hopefully this can still help you.

In our case, the main issue here is that apptainer is making a mess about binaries and libraries. If you look at the apptainer exec documentation:

-e, --cleanenv                      clean environment before running container

This flag will clean the environment on the container, so most of the variables set at my shell will no be passed down, including some paths pointing to MPI different than the one running on the container, which seems to be the issue. This is what I got:

[renato@n05 ~]$ apptainer exec spyro-1_latest.sif bash
Apptainer> . /home/firedrake/firedrake/bin/activate
(firedrake) Apptainer> mpiexec -n 6 python3 -c 'print(1)'
[mpiexec@n05] HYDU_create_process (lib/utils/launch.c:73): execvp error on file srun (No such file or directory)
^C[mpiexec@n05] Sending Ctrl-C to processes as requested
[mpiexec@n05] Press Ctrl-C again to force abort
[mpiexec@n05] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec@n05] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
[mpiexec@n05] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy
[mpiexec@n05] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream
[mpiexec@n05] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@n05] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec@n05] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
(firedrake) Apptainer> exit
exit
[renato@n05 ~]$ apptainer exec -e spyro/spyro-1_latest.sif bash
Apptainer> . /home/firedrake/firedrake/bin/activate
(firedrake) Apptainer> mpiexec -n 6 python3 -c 'print(1)'
1
1
1
1
1
1

apptainer shell also has the -e flag.

Upvotes: 0

Related Questions