Saurabh Pant
Saurabh Pant

Reputation: 73

Setting number of Reduce tasks using command line

I am a beginner in Hadoop. When trying to set the number of reducers using command line using Generic Options Parser, the number of reducers is not changing. There is no property set in the configuration file "mapred-site.xml" for the number of reducers and I think, that would make the number of reducers=1 by default. I am using cloudera QuickVM and hadoop version : "Hadoop 2.5.0-cdh5.2.0". Pointers Appreciated. Also my issue was I wanted to know the preference order of the ways to set the number of reducers.

  1. Using configuration File "mapred-site.xml"

    mapred.reduce.tasks

  2. By specifying in the driver class

    job.setNumReduceTasks(4)

  3. By specifying at the command line using Tool interface:

    -Dmapreduce.job.reduces=2

Mapper :

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{   
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
        String line = value.toString();

        //Split the line into words
        for(String word: line.split("\\W+"))
        {
            //Make sure that the word is legitimate
            if(word.length() > 0)
            {
                //Emit the word as you see it
                context.write(new Text(word), new IntWritable(1));
            }
        }
    }
}

Reducer :

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
    {
        //Initializing the word count to 0 for every key
        int count=0;

        for(IntWritable value: values)
        {
            //Adding the word count counter to count
            count += value.get();
        }

        //Finally write the word and its count
        context.write(key, new IntWritable(count));
    }
}

Driver :

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class WordCount extends Configured implements Tool 
{
    public int run(String[] args) throws Exception
    {
         //Instantiate the job object for configuring your job
        Job job = new Job();

        //Specify the class that hadoop needs to look in the JAR file
        //This Jar file is then sent to all the machines in the cluster
        job.setJarByClass(WordCount.class);

        //Set a meaningful name to the job
        job.setJobName("Word Count");

        //Add the apth from where the file input is to be taken
        FileInputFormat.addInputPath(job, new Path(args[0]));

        //Set the path where the output must be stored
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //Set the Mapper and the Reducer class
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        //Set the type of the key and value of Mapper and reducer
        /*
         * If the Mapper output type and Reducer output type are not the same then
         * also include setMapOutputKeyClass() and setMapOutputKeyValue()
         */
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //job.setNumReduceTasks(4);

        //Start the job and wait for it to finish. And exit the program based on
        //the success of the program
        System.exit(job.waitForCompletion(true)?0:1);
        return 0;
    }

    public static void main(String[] args) throws Exception 
    {
        // Let ToolRunner handle generic command-line options 
        int res = ToolRunner.run(new Configuration(), new WordCount(), args);

        System.exit(res);
    }
}

And I have tried the following commands to run the job :

hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -Dmapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take13

and

hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -D mapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take14

Upvotes: 1

Views: 3914

Answers (1)

NishM
NishM

Reputation: 1726

Answering your query on order. It would always be 2>3>1

The option specified in your driver class takes precedence over the ones you specify as an argument to your GenOptionsParser or the ones you specify in your site specific config.

I would recommend debugging the configurations inside your driver class by printing it out before you submit the job. This way , you can be sure what the configurations are , right before you submit the job to the cluster.

Configuration conf = getConf(); // This is available to you since you extended Configured
for(Entry entry: conf)
   //Sysout the entries here

Upvotes: 1

Related Questions