Reputation: 3614
I am planning to use Hadoop on EC2. Since we have to pay per instance usage, it is not good to have fixed number of instances than what are actually required for the job.
In our application, many jobs are executed concurrently and we do not know the slave requirement all the time. Is it possible to start the hadoop cluster with minimum slaves and later on manage the availability based on requirement?
i.e. create/destroy slaves on demand
Sub question: Can hadoop cluster manage multiple jobs concurrently?
Thanks
Upvotes: 1
Views: 258
Reputation: 8685
Just want to let you know that we are doing some work on this in Apache Whirr. We are tracking progress in WHIRR-214. Vote or join development. :)
Upvotes: 0
Reputation: 3614
This seems promising http://hadoop.apache.org/common/docs/r0.17.1/hod.html
Upvotes: 0
Reputation: 11
The default scheduler that is used in hadoop is a simple FIFO one, you can look into using FairScheduler which assigns a share of the cluster to each of the running jobs and has extensive configuration to control those shares.
As far as EC2 is concerned - you can easily start of with some number of nodes and then once you see that there are too many tasks in the queue and all the slots in the cluster are occupied - add more of them. You will simply have to start up an instance and launch a task tracker on it that will register with the jobtracker.
However you will have to have your own system that will manage startup and shutdown of these nodes.
Upvotes: 1