Reputation: 61
I am trying to run a pytorch application that uses DDP and MPI as a communication backend. I am running it on a cluster that has two network interfaces on all the nodes, a fast ethernet network interface and an infiniband network.
In running with srun, how can I specify the network interface tot be used? I saw that when using just MPI I can add "iface ib0" to my mpirun command. But how do I achievef the same thing when working with slurm.
I have attached below a sample script that I want to use. Can someone verify if it is the right thing to do?
#!/bin/bash
#SBATCH --job-name=resnet50_cifar100_job
#SBATCH --output=resnet50_cifar100_output_opx_%j.txt
#SBATCH --error=resnet50_cifar100_error_opx_%j.txt
#SBATCH --ntasks=16
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
# Source the environment setup script
source $HOME/activate_environment.sh
# Activate the Python virtual environment
source $HOME/torch_mpi_env/bin/activate
#export FI_TCP_IFACE=ib0
#export FI_PROVIDER=psm2
#export I_MPI_FABRICS=ofi
#export I_MPI_FALLBACK=0
export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:ofi
export I_MPI_OFI_PROVIDER=psm2
export MPIP="-f ./mpip_results"
export SLURM_NETWORK=ib0
# Run the Python script
srun --mpi=pmi2 --network=ib0 \
--export=ALL,LD_PRELOAD=$HOME/mpiP_build/lib/libmpiP.so \
python $HOME/torch_projects/resnet50_cifar100.py --epochs 200
# Deactivate the virtual environment
deactivate
Upvotes: 1
Views: 188