stanton63
stanton63

Reputation: 13

Post processing queue for Slurm

My question is not directly regarding SLURM, but rather on how to organize ones work around it. I can summarize the types of jobs I submit to SLURM in two categories: jobs which are part of a process and whose results will be used by subsequent steps and jobs whose results have to be manually revised or processed. Results which can be automatically processed are not really a problem, any error checking can be automatically performed as well. Regarding results which need to me manually verified however I always have the fear of forgetting about some of them. This is a problem especially when I send a large batch of long running jobs and the completion of the jobs may span over several days. Revising all the results is fairly easy if it is being done AFTER all the results are completed if none failed, however I often have to revise results gradually as the jobs end or resubmit some jobs which failed. This introduces a high risk of manual errors being introduced in the process as I could easily forget to process some results or resubmit some failed jobs.

I'd like to have a simple way to remind me of all the jobs which have been completed and need to be revised, as well as a list of all jobs which failed and should be resubmitted. I have thought about a few ways, some of which I have tested and I'm using. None of them is perfect and I'm considering writing a custom software to manage this requirements I have. I am asking this question to see if any alternative solution already exists.

I have identified three solutions:

  1. When submitting a job add the job id to a document of jobs which need to be processed

    • All jobs are available in a list
    • The list creation process can be automated
    • The job directory can be specified and easily retrieved
    • You have to manually check all jobs in the list every time, which makes it cumbersome
  2. Have the jobs submit themselves to a document on completion or failure

    • Only jobs which need to be processed are present in the list
    • The list is automatically created
    • The job directory can be specified in a custom way (useful for job arrays)
    • Each job submission script has to integrate some custom logic for this
  3. Use the sacct command with the workdir output parameter

    • No creation of custom files
    • The list contains all recent jobs
    • The workdir may be inaccurate in case you change directory in the job
    • You cannot filter results which you already processed from the output
    • It may be problematic if you have a backlog an say you have to process a job which completed weeks ago

All these solutions are fairly simple to implement, but each has got its own drawbacks. Do you know of any other solution which could be used in these cases and would lead to better results?

Upvotes: 0

Views: 71

Answers (1)

damienfrancois
damienfrancois

Reputation: 59260

There are various tools that can help in managing jobs, you can review some of them in this document or these videos, though I am not sure they are designed to cope with manual intervention.

The solution I would consider involves using the comment field of jobs. You can submit your jobs with an initial comment with --coment="Initial comment" and then include a command as the last line of the submission script:

scontrol update jobid=$SLURM_JOB_ID comment="Ready for review"

You can then list the comments of all running and pending jobs with

squeue --me --Format jobid,state,comment

and, provided Slurm is configured with AccountingStoreFlags=job_comment, when the job is finished, with

sacct -X --user $USER --format jobid,state,comment

When you have reviewed the job output, you can alter its comment in the Slurm database with

sacctmgr modify job jobid=XXXXXX set comment="Reviewed-OK"

or

sacctmgr modify job jobid=XXXXXX set comment="Reviewed-tofix"

and then, maybe

sacctmgr modify job jobid=XXXXXX set comment="Reviewed-fixed"

This way, you can label the jobs based on the manual actions you need to take on them.

Upvotes: 1

Related Questions