Lu4
Lu4

Reputation: 15032

Running application on a cluster

Abstract

I have my processing done using two console applications (Stage-estimate, Stage-step), each application processes files on disk, files are organized into folders. Each folder represents one step of processing which is considered completed when all files are estimated.

As an example lets consider that we are at Step 0 and the folder 0 contains the following files:

Folder 0 contains:

000.data
001.data
002.data
...
999.data

We have the data files, now we need to estimate them, we run Stage-estimate application 1000 times that result with the following directory structure:

Folder 0 contains:

000.data
000.estimate
001.data
001.estimate
002.data
002.estimate
...
999.data
999.estimate

Step 0 is now complete we have all the data/estimate pairs. In order to switch to Step 1 we run Stage-step application 1000 times on every data/estimate pair files and it results with new set of 1000 *.data files into folder 1. After Stage-step application completed, we have a folder 1 with the same structure as we had on Step 0:

Folder 1 contains:

000.data
001.data
002.data
...
999.data

From now on the process repeats until it is canceled.

The Problem

Application Stage-estimate does some pretty heavy calculations it consumes 99% of overall processing power compared to Stage-step application.

I was planing to use AWS in order to speed the things up. I don't want to start inventing special batch files that would call my applications the way described above, I know that there is special software that does some high-lifting at scheduling processes and other cluster related stuff.

Question

I was never dealing with cluster computing, off top of my head I see that application is parallelized really nice and it fits into AWS infrastructure. On the other hand I'm complete newbie in the world of cluster-computing and I don't know where to start from. I was dealing with AWS however nothing related to cluster computing, I don't know how to organize the flow I've described and how to make it run efficiently, so I would appreciate if you point me in right direction or provide some links on demos / best practices.

Thank you in advance!

Upvotes: 0

Views: 68

Answers (1)

Adam Ocsvari
Adam Ocsvari

Reputation: 8376

__________Edit__________

Based on your comment, you can put all your jobs from stage 0 into a queue and start to process it. You can also have a logic what checks if you have only a few jobs left and tries to add new jobs from stage 1. This would speed up a bit your calculation, gives you better resource usage, but it's optional and makes your system more complex.

I suggest you to use SQS ( Or SWF) for storing the jobs, S3 for storing the files and an autoscaling group of spot instances for worker nodes.

Unfortunately Lambda doesn't support C++ at the moment. ( Node.js and Java is supported.)

________Original________

AWS supports several concepts which you may consider:

Decoupling: You can use SQS (Simple Queue Service) for job queuing, which gives you a redundant and fault tolerant job queue. You can have a fleet of worker instances, which are requesting jobs form the queue, running them and if they are finished, deleting the job from the queue. If the instances hangs/crashes during the execution of the job, after the timeout period the job goes back to the queue and an other instance will execute it again.

Other service is the SWF ( Simple Workflow Service). This service internally uses SQS queues, with this service, you may need less script to glue your entire workflow together.

Redundant storage: I would definitely use AWS S3 for storage, because it's cheap and redundant. After the first read, I don't think you need any advanced (file system like) feature. ( for example locking.)

Spot instances: For the worker nodes, I would use Spot instances which are much cheaper. The only issue with them if you need a really fast answer for your task all the time. ( If you generating daily reports, spot instances are perfect solution.)

+1: You may use AWS Lambda function to run your jobs. You can trigger your lambda function based on S3 events. For example you uploaded a new *.data file. However Lambda functions cannot run too long. But if you are able to use lambda function, then all your environment will contains only S3 buckets and lambda function. Both of them are AWS managed service, so your system would be extremely flexible, fault tolerant. I can't say any exact details about pricing, but I assume it would be cheaper then running EC2 instances.

Summary: If you can run your estimations parallel, AWS gives you a lots of power and speed. (for a good money) especially if your load is changing during the day.

A good source: White Paper on ‘Cloud Architectures’ and Best Practices of Amazon S3, EC2, SimpleDB, SQS

Upvotes: 1

Related Questions