Reputation: 315
I got a .sh file to run by srun
because I want to see the dynamic print-out of the scripts. But by running srun job_spinup.sh southfr_exp 1 &
I always got error (time-out due to time limited error) after 2 main loops...here is the main codes in the .sh file. By the way I want to run a model of 12 months and loop it by 20 times (so-called spin-up 20 times). But the error occurs in the November of second loop (spin-up)...
Here is the code in the job_spinup.sh:
#!/bin/bash
#SBATCH -J spinup
#SBATCH -p knl_cache
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 10:00:00
#SBATCH -o spinup.log
#SBATCH -e spinup.log
#=========================================================================
# USAGE
# nohup ./job_spinup DOM[:EXP] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM] &
#
# by default: EXP=spinup, N=20, START_ID=0, START_MM=1
#=========================================================================
#set -x
#
if [ $# -lt 2 ]; then
echo "Usage: $0 DOM[:EXP:VERSION] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM]"
echo "DOM = the name of a domain"
echo "EXP = the name of an experiment"
echo "N = the number of runnings"
echo "START_ID = start id of a running"
echo "START_MM = start month of a running"
exit
fi
DOM=`echo $1 | awk '{split($1, f, ":"); print f[1]}'`
EXP=`echo $1 | awk '{split($1, f, ":"); print f[2]}'`
EXP=${EXP:-spinup}
VERSION=`echo $1 | awk '{split($1, f, ":"); print f[3]}'`
VERSION=${VERSION:--X0}
num_nodes=`echo ${2} | awk '{split($1, f, ":"); print f[1]}'`
tasks_per_node=`echo ${2} | awk '{split($1, f, ":"); print f[2]}'`
tasks_per_node=${tasks_per_node:-40}
tasks_for_trip=`echo ${2} | awk '{split($1, f, ":"); print f[3]}'`
tasks_for_trip=${tasks_for_trip:-1}
SPINUP_N=${3:-20}
START_ID=`echo $4 | awk '{split($1, f, ":"); print f[1]}'`
START_ID=${START_ID:-0}
START_MM=`echo $4 | awk '{split($1, f, ":"); print f[2]}'`
START_MM=${START_MM:-1}
# source ~/anaconda3/etc/profile.d/conda.sh
source $(conda info --base)/etc/profile.d/conda.sh
conda activate myenv
echo "***************************************"
echo " CONDA ENV ACTIVATED FOR NCO COMMAND"
echo "***************************************"
echo $SPINUP_N
#
# check if TRIP is used
LTRIP=`grep "LOASIS *= *T" OPTIONS/OPTIONS.nam | wc -l`
#
ulimit -s unlimited
ulimit -n 500000
ulimit -u 64000
unset I_MPI_PMI_LIBRARY
export OMP_NUM_THREADS=1
export DR_HOOK=0
export DR_HOOK_OPT=prof
...
YYYY=${YYYYMMDDHH::4}
MM=${YYYYMMDDHH:4:2}
j=$START_ID
while [ $j -lt $SPINUP_N ] ; do
echo " "
echo "------------------"
echo "SPINUP : $j / $SPINUP_N"
while [ $MM -le 12 ] ; do
if [ $LTRIP -eq 1 ]; then
mpirun -np $((SLURM_NTASKS - tasks_for_trip)) offline.exe : -np $tasks_for_trip trip.exe &> offline
else
#echo ${SLURM_NTASKS}
#mpirun -np ${SLURM_NTASKS} offline.exe &> offline
#srun -n 1 offline.exe &> offline
offline.exe &> offline
fi
....
# Change dates to start again
if [ $MM -eq 12 ]; then
ncap2 -O -s "'DTCUR-YEAR'=$YYYY;'DTCUR-MONTH'=1;'DTCUR-DAY'=1;'DTCUR-TIME'=0" PREP.nc PREP.nc
[ $LTRIP -eq 1 ] && ncap2 -O -s "date(:)={$YYYY,1,1,0}" TRIP_PREP.nc TRIP_PREP.nc
fi
...
done
echo '------------------'
echo ' '
MM=01
j=$(( j+1 ))
done
...
# end simulation
date >> date_$EXP
echo "***************************************"
echo " SPINUP ENDS CORRECTLY"
echo "***************************************"
conda deactivate
echo "***************************************"
echo " CONDA ENV DEACTIVATED"
echo "***************************************"
and the output is like this:
(base) [xushan@int2 southfr_exp]$ srun job_spinup.sh southfr_exp 1 &
[1] 11570
(base) [xushan@int2 southfr_exp]$ srun: job 8860513 queued and waiting for resources
srun: job 8860513 has been allocated resources
***************************************
CONDA ENV ACTIVATED FOR NCO COMMAND
***************************************
20
./job_spinup.sh: line 62: ulimit: open files: cannot modify limit: Operation not permitted
***************************************
READY TO START SPINUP on tcn991.bullx
spinup 20 0:1
***************************************
------------------
SPINUP : 0 / 20
199601
1
199602
1
199603
1
199604
1
199605
1
199606
1
199607
1
199608
1
199609
1
199610
1
199611
1
199612
1
------------------
------------------
SPINUP : 1 / 20
199601
1
199602
1
199603
1
199604
1
199605
1
199606
1
199607
1
199608
1
199609
1
199610
1
srun: Force Terminated job 8860513
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8860513.0 ON tcn991 CANCELLED AT 2020-09-07T12:51:24 DUE TO TIME LIMIT ***
srun: error: tcn991: task 0: Terminated
srun: Terminating job step 8860513.0
Is there anyone who can help me? thanks a lot! I am a beginner for slurm.....Is it because I activated a conda environment? and by squeue, I can see the queue lasts for 5 minutes only...no idea about why....is it because offline.exe?
Upvotes: 0
Views: 7576
Reputation: 1685
srun
does not read job scripts like sbatch
does. This means that all your #SBATCH
options are ignored, including the time limit you set for the job. Your job therefore goes to the default partition with the default time limit, which only seems to be enough time for two loops.
There are multiple ways to solve it:
sbatch
and take a look at your output file (tail -f spinup.log
)sbatch
and attach to the job with sattach#SBATCH
options as parameters to srun
Upvotes: 2