fozziethebeat
fozziethebeat

Reputation: 1180

Limiting the number of mappers running on Hadoop Streaming

Is it possible to limit the number of mappers running for a job at any given time using Hadoop Streaming? For example, I have a 28 node cluster that can run 1 task per node. If I have a job with 100 tasks, I'd like to only use say 20 out of the 28 nodes at any point in time. I'd like to do limit some jobs because they may contain many long running tasks and I sometimes want to run some faster running jobs and be sure that they can run immediately, rather than wait for the long running job to finish.

I saw this question and the title is spot on but the answers don't seem to address this particular issue.

Thanks!

Upvotes: 0

Views: 773

Answers (2)

WestCoastProjects
WestCoastProjects

Reputation: 63032

Following option may make sense if the amount of work in each mapper is substantial, since this strategy does involve overhead of reading up to 20 counters in each map invocation.

Create a group of counters and make the groupname MY_TASK_MAPPERS . make the key equal to MAPPER<1..K> where K is the max #of mappers you want. Then in the Mapper iterate through the counters until one of them is found to be 0. Place the machine's un-dotted ip address as a long value in the counter - effectively assigning that one machine to that mapper. If instead all K are already taken, then just quit the mapper without doing anything.

Upvotes: 0

David Gruzman
David Gruzman

Reputation: 8088

While i am not aware about "node-wise" capacity scheduling, there is alternative scheduler built for the very similar case: Capacity Scheduler.

http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html

You should define special queue for potentially long jobs and queue for short jobs and this scheduler will care to have some capacity to be always available for each queue's jobs.

Upvotes: 1

Related Questions