samshers
samshers

Reputation: 3660

Controlling number of map and reduce jobs spawned?

I am trying to understand how may map reduce jobs get started for a task and how to control the number of MR jobs.

Say I have a 1TB file in HDFS and my block size is 128MB. For a MR task on this 1TB file if I specify the input split size as 256MB then how many Map and Reduce jobs gets started. From my understanding this is dependent on split size. i.e number of Map jobs = total size of file / split size and in this case it works out to be 1024 * 1024 MB / 256 MB = 4096. So the number of map task started by hadoop is 4096.
1) Am I right?

2) If I think that this is an inappropriate number, can I inform hadoop to start less number of jobs or even more number of jobs. If yes how?

And how about the number of reducer jobs spawned, I think this is totally controlled by the user.
3) But how and where should I mention the number of reducer jobs required.

Upvotes: 1

Views: 904

Answers (1)

Anurag Yadav
Anurag Yadav

Reputation: 396

1. Yes, you're right. No of mappers=(size of data)/(input split size). So, in your case it would be 4096

  1. As per my understanding ,Before hadoop-2.7 you can only hint system to create number of mapper by conf.setNumMapTasks(int num) but mapper will created by their own. From hadoop-2.7 you can limit number of mapper by mapreduce.job.running.map.limit. See this JIRA ticket

  2. By default number of reducer is 1. You can change it by job.setNumReduceTasks(integer_numer);

You can also provide this parameter from cli -Dmapred.reduce.tasks=<num reduce tasks>

Upvotes: 2

Related Questions