Reputation: 587
I have a Python script that processes approximately 10,000 FITS files one by one. For each file, the script generates an output in the same directory as the input files and creates a single CSV file to record statistics about the processed files.
Previously, I parallelized the script using async with multiprocessing pools, but now I have access to a SLURM cluster and would like to run it using SLURM.
What is the simplest way to achieve this? All the files are stored in the same directory, and there’s no specific order in which they need to be processed. EDIT: I also need to activate conda enviroment before running python script. Python should accept filename and start running code. Usually I send filename via args. Thanks
******** EDIT update:
I managed to make it work.
First, I created bash script for submitting jobs:
#!/bin/bash
# Define the directory containing FITS files
INPUT_DIR="input_dir"
LOG_DIR="${INPUT_DIR}/logs"
# Ensure the logs directory exists
mkdir -p "$LOG_DIR"
# List all FITS files and write their paths to a temporary file
find "$INPUT_DIR" -name "*.fits" > file_list.txt
# Loop through each FITS file and submit a SLURM job
while IFS= read -r filepath; do
sbatch run2.sh "$filepath"
done < file_list.txt
So, that script is calling run2.sh script which contains following:
#!/bin/bash
#SBATCH -p long
#SBATCH -J test
#SBATCH -n 1
#SBATCH -t 00:05:00
#SBATCH --output=file.out
#SBATCH --error=file.err
source miniconda3/bin/activate my_env
# Define variables
# EVENT_PATH="directory_path"
# Run Python script
python3 -u my_python_code.py "$1" "False" 3
My next concern is that in this way I am creating 10k jobs, because I have 10k images to analyse, although analyzing each image only takes few seconds. Maybe there is smarter way to do it.
Upvotes: 3
Views: 39
Reputation: 349
I had a similar requirement some time ago and below is the script I used to solve it. What you need are SLURM array jobs, where each job will get its own set of resources and can run on a different file.
Below, I used used the $SLURM_ARRAY_TASK_ID
environment variable as a python argument (sys.argv[2]
) to decide which file to operate on. It is essentially the index of the job within the job array, as defined in the link to docs above. The %a
in the --job-name
and --output
is also replaced by this index to generate unique names for each job/output. You can pass additional parameters to the slurm script, and then to the python script, like the arg1
-> $1
-> sys.argv[1]
Ofcourse your core/memory/time requirements will be different.
#!/bin/bash
# use as:
# sbatch --job-name=name_%a --output=out_%a.txt --array=1-nFiles testslurm.sh arg1
#-------------------------------------------------------------
#-------------------------------------------------------------
#
#
#Number of CPU cores to use within one node
#SBATCH -c 12
#
#Define the number of hours the job should run.
#Maximum runtime is limited to 10 days, ie. 240 hours
#SBATCH --time=24:00:00
#
#Define the amount of RAM used by your job in GigaBytes
#In shared memory applications this is shared among multiple CPUs
#SBATCH --mem=64G
#Do not export the local environment to the compute nodes
#unset SLURM_EXPORT_ENV
#
#Set the number of threads to the SLURM internal variable
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#
#load the respective software module you intend to use
#module load YourModuleHere
#
#run the respective binary through SLURM's srun
conda init bash
conda activate suite2p
srun --cpu_bind=verbose python batchfunc.py ~/codes/data/$1 $SLURM_ARRAY_TASK_ID
Upvotes: 2