Jonathan Ato Markin
Jonathan Ato Markin

Reputation: 61

Specifying Network Interface When Using Slurm and Intel MPI

I am trying to run a pytorch application that uses DDP and MPI as a communication backend. I am running it on a cluster that has two network interfaces on all the nodes, a fast ethernet network interface and an infiniband network.

In running with srun, how can I specify the network interface tot be used? I saw that when using just MPI I can add "iface ib0" to my mpirun command. But how do I achievef the same thing when working with slurm.

I have attached below a sample script that I want to use. Can someone verify if it is the right thing to do?

#!/bin/bash
#SBATCH --job-name=resnet50_cifar100_job
#SBATCH --output=resnet50_cifar100_output_opx_%j.txt
#SBATCH --error=resnet50_cifar100_error_opx_%j.txt
#SBATCH --ntasks=16                
#SBATCH --nodes=4             
#SBATCH --ntasks-per-node=4        

# Source the environment setup script
source $HOME/activate_environment.sh

# Activate the Python virtual environment
source $HOME/torch_mpi_env/bin/activate


#export FI_TCP_IFACE=ib0
#export FI_PROVIDER=psm2
#export I_MPI_FABRICS=ofi
#export I_MPI_FALLBACK=0
 
export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:ofi
export I_MPI_OFI_PROVIDER=psm2

export MPIP="-f ./mpip_results"

export SLURM_NETWORK=ib0

# Run the Python script
srun --mpi=pmi2 --network=ib0 \
     --export=ALL,LD_PRELOAD=$HOME/mpiP_build/lib/libmpiP.so \
     python $HOME/torch_projects/resnet50_cifar100.py --epochs 200


# Deactivate the virtual environment
deactivate

Upvotes: 1

Views: 188

Answers (0)

Related Questions