Reputation: 13
My question is not directly regarding SLURM, but rather on how to organize ones work around it. I can summarize the types of jobs I submit to SLURM in two categories: jobs which are part of a process and whose results will be used by subsequent steps and jobs whose results have to be manually revised or processed. Results which can be automatically processed are not really a problem, any error checking can be automatically performed as well. Regarding results which need to me manually verified however I always have the fear of forgetting about some of them. This is a problem especially when I send a large batch of long running jobs and the completion of the jobs may span over several days. Revising all the results is fairly easy if it is being done AFTER all the results are completed if none failed, however I often have to revise results gradually as the jobs end or resubmit some jobs which failed. This introduces a high risk of manual errors being introduced in the process as I could easily forget to process some results or resubmit some failed jobs.
I'd like to have a simple way to remind me of all the jobs which have been completed and need to be revised, as well as a list of all jobs which failed and should be resubmitted. I have thought about a few ways, some of which I have tested and I'm using. None of them is perfect and I'm considering writing a custom software to manage this requirements I have. I am asking this question to see if any alternative solution already exists.
I have identified three solutions:
When submitting a job add the job id to a document of jobs which need to be processed
Have the jobs submit themselves to a document on completion or failure
Use the sacct command with the workdir output parameter
All these solutions are fairly simple to implement, but each has got its own drawbacks. Do you know of any other solution which could be used in these cases and would lead to better results?
Upvotes: 0
Views: 71
Reputation: 59260
There are various tools that can help in managing jobs, you can review some of them in this document or these videos, though I am not sure they are designed to cope with manual intervention.
The solution I would consider involves using the comment
field of jobs. You can submit your jobs with an initial comment with --coment="Initial comment"
and then include a command as the last line of the submission script:
scontrol update jobid=$SLURM_JOB_ID comment="Ready for review"
You can then list the comments of all running and pending jobs with
squeue --me --Format jobid,state,comment
and, provided Slurm is configured with AccountingStoreFlags=job_comment
, when the job is finished, with
sacct -X --user $USER --format jobid,state,comment
When you have reviewed the job output, you can alter its comment in the Slurm database with
sacctmgr modify job jobid=XXXXXX set comment="Reviewed-OK"
or
sacctmgr modify job jobid=XXXXXX set comment="Reviewed-tofix"
and then, maybe
sacctmgr modify job jobid=XXXXXX set comment="Reviewed-fixed"
This way, you can label the jobs based on the manual actions you need to take on them.
Upvotes: 1