Reputation: 656
I thought MaxRSS
was used to get an understanding of the memory requirements of SLURM
jobs; however, now I'm questioning myself.
I received a notification from SLURM
that my job had failed.
SLURM Job_id=7347729 Name=job.cph.proband Ended, Run time 00:01:21, OUT_OF_MEMORY
I used sacct
to check why the job failed; however, it looks like it failed with a OOM error. This is odd as it looks like it only tried to use 1.61 Gb
of the requested 3 Gb
(shown as 2.93
here).
Either my understanding of MaxRSS
is wrong or this job is failing for another reason?
Upvotes: 6
Views: 9724
Reputation: 96976
It is suggested in this wiki post that the job manager may not get usage data fast enough to track a spike in memory usage, for the sacct
tool to give you a specific answer:
SLURM's accounting mechanism is polling based and doesn't always catch spikes in memory usage. FSL's implementation uses a Linux kernel feature called "cgroups" to control memory and CPU usage. SLURM sets up a cgroup for the job with the appropriate limits which the Linux kernel strictly enforces.
The problem is simple: the kernel killed a process from the offending job and the SLURM accounting mechanism didn't poll at the right time to see the spike in usage that caused the kernel to kill the process.
That your sacct
call shows 1.6 GB usage just before the 3 GB job is cancelled might be suggestive of how your process is using memory.
A data structure used by your process may require resizing as it grows. In the process of reallocating that data, your process may temporarily ask for a chunk of memory larger than what Slurm has available for that job.
Depending on implementation, a C++ std::vector
, for instance, may try to create a temporary, new vector that is twice or some other multiple of size, once enough elements are added, to copy over data from the old vector.
Speaking in general terms, without knowing any specifics about what you're running, the temporary creation of a data structure that is twice 1.6 GB in size would seem to be enough to trigger job cancellation, in your example, in addition to any space already allocated.
Upvotes: 9