Reputation: 6896
I am working to put a distributed computing framework in place, and am investigating using 0mq as the underpinning communication layer. Jobs, which can take up to two hours to run, can be started on any worker in a cluster (assumption that all workers in a cluster have access to the same resources and are capable of running the same jobs). The manager is responsible for monitoring the system state and triggering the jobs. The worker machines are also multi-core, but I am ignoring this portion for now.
The problem I am trying to solve is distributing jobs messages to workers, with:
Ignoring the multi core, and assuming a cluster of 5 workers. If the manager puts 7 jobs into the queue, we should have five of them should be running, one on each worker, with two remaining in queue. When one job ends the worker should receive the next job in the queue.
I have run experiments and researched:
The questions are:
Upvotes: 0
Views: 612
Reputation: 1216
You need to devise a flow control protocol, in addition to sending work messages to the workers.
A protocol I like has the workers sending Ready messages to the manager. The manager puts the ready workers into a list, and replies to the Ready messages one at a time until there is no more work. This is load balancing which is a commonly used pattern, and indeed there's the Load Balancing example in chapter 3 of the zmq guide. I think ROUTER/DEALER sockets work best.
The beauty of load balancing is it's sheer simplicity. The manager just has to maintain a list of jobs, another list of ready workers, and reply to their Ready messages with the work. It doesn't even matter what kind of list it is, FIFO, LIFO, random.
The fact that the workers are running on multicore systems shouldn't be special. It's simply the case that workers should try to avoid starving each other of any resources, whether that's disk, memory, or CPU. Of course, that's another layer independent of the load balancing pattern.
Upvotes: 1