Reputation: 3621
I am trying to submit a script to slurm that runs m4 on an input file. m4 is installed on our cluster, and if I run the script by itself, everything works as expected. But when I submit a run to slurm via a slurm script, I get an error.
Here is the script I would like to run (named m4it.sh).
[Note that I'm printing PATH and SHELL in an attempt to debug.]
#!/usr/bin/env bash
echo "Beginning m4it.sh"
echo "PATH=$PATH"
echo "SHELL=$SHELL"
echo
m4 file.m4 > fileout.txt
and here is my slurm script:
#!/usr/bin/env bash
#
#SBATCH --job-name=m4it
### Account name (req'd)
#SBATCH --account=MyAccount
### Redirect .o and .e files to the logs dir
#SBATCH -o m4it.out
#SBATCH -e m4it.err
#
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH --mem-per-cpu=125
echo "PATH=$PATH"
echo "SHELL=$SHELL"
echo
echo "running m4it.sh"
echo
./m4it.sh
which submits successfully to slurm via
sbatch m4it.slurm
When it executes, I get the following error in my m4it.err logfile:
./m4it.sh: line 8: m4: command not found
The PATH and the SHELL variables (printed to m4it.out by the m4it.slurm and by the m4it.sh scripts) are identical. The PATH contains my PATH when I login, and SHELL is /bin/bash, as expected.
Even if I include a symlink to the m4 executable from a directory in my PATH, I still get this error. Also, it is not just m4 that is the problem. The script will report the command "apropos" as an unknown command, even though it runs fine on the command line. The script can "cd" and "ls" just fine though.
I've checked read/write/execute permissions.
ls -ld / /usr /usr/bin /usr/bin/m4
yields the following:
dr-xr-xr-x. 30 root root 4096 Apr 8 11:11 /
drwxr-xr-x. 14 root root 4096 Feb 17 20:24 /usr
dr-xr-xr-x. 2 root root 36864 Apr 29 11:14 /usr/bin
-rwxr-xr-x 1 root root 212440 Jun 3 2010 /usr/bin/m4
It seems that the node the m4it.sh script executes on is different from the front node and that somehow information (environment variables or paths) are not coming across. I have also tried to export all my settings with the argument --export=ALL as follows:
sbatch m4it.slurm --export=ALL
but this didn't work either (same result). Can anyone help here?
Upvotes: 3
Views: 23543
Reputation: 3621
I was able to log in to the compute node in an interactive session. Indeed that node's /usr/bin is significantly different than the front node's, and m4 is not installed.
This also explains why the symlink from a directory in my PATH no longer worked. It was pointing to /usr/bin/m4, but as soon as the job was executed on that compute node, /usr/bin/m4 no longer existed, and thus the symlink was invalid.
If I want to use m4, the solution is to either ask the admins to install m4 on the compute nodes or, alternatively, copy a local version of the executable to somewhere in my home directory that exists in my PATH variable.
Upvotes: 3