Fredrik Karlsson
Fredrik Karlsson

Reputation: 33

Amazon batch job process queue

I am using AWS EC2 instances for bioinformatics work. I have a number (~1000) of large files that should be processed with a script on EC2 instances and results should be uploaded back to an S3 bucket. I want to distribute jobs (files) to a number of EC2 instances, preferentially started on spot prices.

What I need is a simple to use queuing system (possibly AWS SQS or something else) that can distribute jobs to instances and restart jobs if an instance fails (due to too high spot prices or other reasons). I have studied AWS SQS examples but these are too advanced and typically involve auto-scaling and complicated message generating applications.

Can someone point out conceptually how to solve this best and in the easiest way? Any examples of this simple application of AWS SQS? How should a bunch of instances be started and how to tell then to listen to the queue?

My workflow is basically like this for each input file:

aws s3 cp s3://mybucket/file localFile ## Possibly streaming the file without copy
work.py --input localFile --output outputFile
aws s3 cp outputFile s3://mybucket/output/outputFile

Upvotes: 2

Views: 5240

Answers (2)

Jonatan
Jonatan

Reputation: 730

As of ~December 2016 AWS has launched a service called AWS Batch which may be a good (perhaps even great) fit for the workload described in the question. Please review Batch before selecting one of the other suggestions.

Upvotes: 5

John Rotenstein
John Rotenstein

Reputation: 270124

You are describing a very common batch-oriented design pattern:

  • Work is placed in a queue
  • One or more "worker" instances pull work from the queue
  • The number of worker instances scales based upon the size of the queue and urgency of the work
  • Using spot pricing to minimise costs

The best way to accomplish this is:

  • Using Amazon Simple Queueing Service (SQS) to store the work requests
  • Launching Amazon EC2 instances, each of which repeatedly:
    • Pull a message from the queue
    • Process the message (eg via the download/process/upload steps you listed above)
    • Remove the message from the queue (to signify that the work was completed)
  • Using Auto-Scaling to control the number of instances, so that more instances can be launched where the is a large backlog, and all instances can be turned off when there is no work
  • Using Spot pricing with the Auto-Scaling Group, so that instances will automatically "come back to life" once the Spot price drops your maximum bid price

Rather than having the queueing system "distribute jobs to instances and restart jobs if an instance fails", SQS would merely be used to store the jobs. Auto-Scaling would be responsible for launching instances (including restarts are Spot price changes) and the instances themselves would pull the work from a queue. Think of it as a "pull" model rather than a "push" model.

While the total system might seem complex, each individual component is quite simple. I would recommend doing it one step at a time:

  1. Have a system that somehow pushes the work requests into an SQS queue. This could be as simple as using aws sqs put-message from the CLI, or adding a few lines of code in Python using Boto (the AWS SDK for Python).

Here's some sample code (call it with a message on the command-line):

#!/usr/bin/python27

import boto, boto.sqs
from boto.sqs.message import Message
from optparse import OptionParser

# Parse command line
parser = OptionParser()
(options, args) = parser.parse_args()

# Send to SQS
q_conn = boto.sqs.connect_to_region('ap-southeast-2')

q = q_conn.get_queue('my-queue')
m = Message()
m.set_body(args[0])
print q.write(m)

print args[0] + ' pushed to Queue'
  1. Configure an Amazon EC2 instance that can automatically start your Python app or script that pulls from SQS and processes your work. Use the User Data field to trigger work when the instance starts. Either run your workflow from a shell script, or you can write the S3 upload/download code as part of your Python app too (including the loop to keep pulling new messages).

Here's some code to retrieve a message from a queue:

#!/usr/bin/python27

import boto, boto.sqs
from boto.sqs.message import Message

# Connect to Queue
q_conn = boto.sqs.connect_to_region('ap-southeast-2')
q = q_conn.get_queue('my-queue')

# Get a message
m = q.read(visibility_timeout=15)
if m == None:
  print "No message!"
else:
  print m.get_body()
  q.delete_message(m)
  1. Configure an Auto-Scaling Launch Configuration that matches the configuration you just created for EC2. This tells Auto-Scaling how to start an instance (eg instance type, User Data) and what Spot Price you're willing to pay.
  2. Create an Auto Scaling group to automatically launch instances.
  3. Configure Scaling policies if you'd like the Auto Scaling group add/remove instances based upon the size of the queue

See also:

Upvotes: 11

Related Questions