user3200387
user3200387

Reputation: 151

Why do I keep getting NonZeroExitCode when using sbatch SLURM?

I have a simple test.ksh that I am running with the command:

sbatch test.ksh

I keep getting "JobState=FAILED Reason=NonZeroExitCode" (using "scontrol show job")

I have already made sure of the following:

  1. slurmd and slurmctld are up and running correctly
  2. user privileges on "test.ksh" is 777.
  3. The command "srun test.ksh" (by itself, without using sbatch) succeeds without problems
  4. I tried putting in a "return 0" in the last line of "test.ksh" without luck
  5. I tried putting in a "exit 0" in the last line of "test.ksh" without luck
  6. I tried putting in "hostname" in the last line of "test.ksh" without luck
  7. I tried putting in "srun hostname" in the last line of "test.ksh" without luck

Upvotes: 6

Views: 8601

Answers (3)

Vinayak
Vinayak

Reputation: 11

Sometimes the issue is due to missing folders.

You can check the output job file locations using scontrol show job <PID> and checking for StdOut and StdErr fields.

In my case the slurm folder was missing.

Resolve it by creating the missing folder(s).

Upvotes: 0

stats con chris
stats con chris

Reputation: 347

In my case it was because my folder owner was root when I was actually using a second user. I made the mistake to create the folder as root in the home folder of a particular user. use chown user:usergroup foldername and it fixes the problem

Upvotes: 0

user3200387
user3200387

Reputation: 151

I found out that I hadn't set --error and --output, which meant that the default was the current directory from which I was issuing the command.

The problem was that I didn't have sufficient privileges to write to the current directory.

The solution was to set the --error and --output to directories to a place where I had privileges.

Upvotes: 7

Related Questions