Dynamically reordering jobs in a multiprocessing pool in Python

Question

I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.

I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).

My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.

import multiprocessing
import subprocess

class Job(object):
  def __init__(self, popenArgs, runTime, children)
    self.popenArgs = popenArgs #list to be fed to popen
    self.runTime = runTime #Approximate runTime for the job
    self.children = children #Jobs that require this job to run first

def runJob(job):
  subprocess.Popen(job.popenArgs).wait()
  ####################################################
  #I want to remove this, and instead kick these back to the pool
  for j in job.children: 
    runJob(j)
  ####################################################

def main(jobs):
  # This jobs argument contains only jobs which are ready to be run
  # ie no children, only parent-less jobs
  jobs.sort(key=lambda job: job.runTime, reverse=True)
  multiprocessing.Pool(4).map(runJob, jobs)

Alp · Accepted Answer

First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.

Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.

Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.

import threading
import subprocess
from Queue import LifoQueue

class Job(object):
  def __init__(self, popenArgs, runTime, children):
    self.popenArgs = popenArgs
    self.runTime = runTime
    self.children = children

def run_jobs(queue):
  while True:
    job = queue.get()
    subprocess.Popen(job.popenArgs).wait()
    for child in job.children: 
      queue.put(child)
    queue.task_done()

# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
  job_queue = LifoQueue()
  num_workers = 4
  jobs.sort(key=lambda job: job.runTime)
  for job in jobs:
    job_queue.put(job)
  for i in range(num_workers):
    t = threading.Thread(target=run_jobs, args=(job_queue,))
    t.daemon = True
    t.start()
  job_queue.join()

A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.

(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.

Dynamically reordering jobs in a multiprocessing pool in Python

Answers (1)

Related Questions