Reputation: 1
I have been struggling trying to get multiple instances of a python script to run on SLURM. In my login node I have installed python3.6 and I have a python script "my_script.py" which takes a text file as input to read in run parameters. I can run this script on the login node using
python3.6 my_script.py input1.txt
Furthermore, I can submit a script submit.sh to run the job:
#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output1.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
python3.6 my_script.py input1.txt
This runs fine and executes as expected. However, if I submit the following script:
#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output2.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
python3.6 my_script.py input2.txt
while the first is running I get the following error message in output2.txt:
/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not
found
I found that I have this same issue when I try to submit a job as an array. For example, when I submit the following with sbatch:
!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=out_%j.txt
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
echo PWD $PWD
cd $SLURM_SUBMIT_DIR
python3.6 my_script.py input_$SLURM_ARRAY_TASK_ID.txt
~
I find that only out_1.txt shows that the job ran. All of the output files for tasks 2-10 show the same error message:
/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not
I am running all of these scripts in an HPC cluster that I created using the Compute Engine API in the Google Cloud Platform. I used the following tutorial to set up the SLURM cluster:
https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0
Why is SLURM unable to run multiple python3.6 jobs at the same time and how can I get my array submission to work? I have spent days going through SLURM FAQs and other stack questions but I have not found out a way to resolve this issue or a suitable explanation of whats causing the issue in the first place.
Thank you
Upvotes: 0
Views: 1136
Reputation: 1
I found out what I was doing wrong. I had created a cluster with two compute nodes, compute1 and compute2. At some point when I was trying to get things to work I had submitted a job to compute1 with the following commands:
# Install Python 3.6
sudo yum -y install python36
# Install python-setuptools which will bring in easy_install
sudo yum -y install python36-setuptools
# Install pip using easy_install
sudo easy_install-3.6 pip
from the following post:
How do I install python 3 on google cloud console?
This had installed python3.6 on compute1 and that is why my jobs would run on compute1. However, I didn't think this script had run successfully I never submitted it to compute2, and therefore the jobs sent to compute2 could not call python3.6. For some reason I thought Slurm was using python3.6 from the login node since I had sourced a path to it in my sbatch submission.
After installing python3.6 on cluster2 I was then able to import all of my locally installed python libraries based on the following link by including
import sys
import os
sys.path.append(os.getcwd())
at the beginning of my python script.
How to import a local python module when using the sbatch command in SLURM
Upvotes: 0