Modifying files via slurm epilog script is not effective

Question

I'm on CentOS 6.9 running slurm 17.11.7. I've modified my /gpfs0/export/slurm/conf/epilog script. I'm ultimately would like to print out job resource utilization information to the stdout file used be each users' job.

I've been testing it within the conditional at the end of the script for myself before I roll it out to other users. Below is my modified epilog script:

#!/bin/bash
# Clear out TMPDIR on the shared file system after job completes
exec >> /var/log/epilog.log
exec 2>> /var/log/epilog.log    

if [ -z $SLURM_JOB_ID ]
then
        echo -e " This script should be executed from slurm."
        exit 1
fi

TMPDIR="/gpfs0/scratch/${SLURM_JOB_ID}"

rm -rf $TMPDIR

### My additions to the existing script ###
if [ "$USER" == "myuserid" ]
then
    STDOUT=`scontrol show jobid ${SLURM_JOB_ID} | grep StdOut | awk 'BEGIN{FS="="}{print $2}'`
    # Regular stdout/stderr is not respected, must use python.
    python -c "import sys; stdout=sys.argv[1]; f=open(stdout, 'a'); f.write('sticks
'); f.close();"  ${STDOUT}

fi
exit 0

From the Prolog and Epilog section of the slurm.conf user manual it seems that stdout/stderr are not respected. Hence I modify the stdout file with python.

I've picked the compute node node21 to run this job, so I logged into node21 and tried several things to get it to notice my changes to the epilog script.

Reconfiguring slurmd:

sudo scontrol reconfigure

Restart slurm daemon:

sudo service slurm stop
sudo service slurm start

Neither of which seems to get the changes to the epilog script when I submit jobs. When put the same conditional in a batch script it runs flawlessly:

#!/bin/bash
#SBATCH --nodelist=node21
echo "Hello you!"
echo $HOSTNAME

if [ "$USER" == "myuserid" ]
then
    STDOUT=`scontrol show jobid ${SLURM_JOB_ID} | grep StdOut | awk 'BEGIN{FS="="}{print $2}'`
    python -c "import sys; stdout=sys.argv[1]; f=open(stdout, 'a'); f.write('sticks
'); f.close();"  ${STDOUT}
    #echo "HELLO! ${USER}"
fi

QUESTION : Where am I going wrong?

EDIT : This is a MWE from within the context of trying to print resource utilization of jobs at the end of the output.

irritable_phd_syndrome · Accepted Answer

To get this, append the end of the epilog.log script with

# writing job statistics into job output
OUT=`scontrol show jobid ${SLURM_JOB_ID} | grep StdOut | awk 'BEGIN{FS="="}{print $2}'`
echo -e "sticks" >> ${OUT} 2>&1

There was no need to restart the slurm daemons. Additional commands can be added to it to get resource utilization, e.g.

sleep 5s   ### Sleep to give chance for job to be written to slurm database for job statistics.
sacct --units M --format=jobid,user%5,state%7,CPUTime,ExitCode%4,MaxRSS,NodeList,Partition,ReqTRES%25,Submit,Start,End,Elapsed -j $SLURM_JOBID >> $OUT 2>&1

Basically, you can still append the output file using >>. Evidently, it did not occur to me that regular output redirection still works. It is still unclear why the python statement to this did not work.

Modifying files via slurm epilog script is not effective

Answers (2)

Related Questions