Reputation: 1023

Hadoop - Classic MapReduce Wordcount

In my Reducer code, I am using this code snippet to sum the values:

for(IntWritable val : values) {
    sum += val.get();           
}

As the above mentioned gives me expected output, I tried changing the code to:

for(IntWritable val : values) {
    sum += 1;
}

Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()? Why does it give me the same output? Does it have anything to do with Combiner, because when I used this same reducer code as Combiner, class the output was incorrect with all words showing a count of 1.

Mapper Code :

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        StringTokenizer token = new StringTokenizer(line);

        while(token.hasMoreTokens()) {
            word.set(token.nextToken());
            context.write(word, new IntWritable(1));
        }
    }

Reducer Code :

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;

        for(IntWritable val : values) {
            //sum += val.get();
            sum += 1;
        }
        context.write(key, new IntWritable(sum));
    }

Driver Code:

job.setJarByClass(WordCountWithCombiner.class);
        //job.setJobName("WordCount");

        job.setMapperClass(WordCountMapper.class);
        job.setCombinerClass(WordCountReducer.class);
        job.setReducerClass(WordCountReducer.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

Input - "to be or not to be"

Expected Output - (be,2) , (to,2) , (or,1) , (not,1)

But Output i am getting is - (be,1) , (to,1) , (or,1) , (not,1)

Upvotes: 3

Answers (3)

RojoSam

Reputation: 1496

All depends on the value of sum += val.get();

If always val.get() return 1, then sum += val.get(); is the same than sum += 1; as it is happening in your reducer.

BUT

The combainer is used to do a pre-aggregation (similar than the reducer aggregation) in the mapper side, previous to send the key-values pairs to the recuder(s).

Hadoop framework doesn't warranty the times that the combiner is executed by Mapper, it will depend on the number of Mapper's outputs. Then, if only one time the combiner is executed, the aggregation in the mapper side will be ok but in the reducer instead to only receive 1's you could receive other number (val.get() >= 1). And if you use sum += 1; in your reducer, you will be dropping the aggregated numbers in the mapper, generating a wrong output.

If the combiner is executed more than one time in the Mapper side, then you could imagine that the problem could be even worst.

In summary, sum += 1; only works if and only if that statement is executed only one time for each key-value. Using the combiner, that is not warranted.

Upvotes: 1

YoungHobbit

Reputation: 13402

Can anyone please explain what is the difference it makes when I use sum += 1 in the reducer rather than sum += val.get()?

Both the statements are performing addition operation. In the first you are counting how many times the for-loop has run. In the later, you are are actually doing a sum operation, on the int value returned by the each val object for a given key.

Why does it give me the same output? Does it have anything to do with Combiner

The answer is Yes. It is because of the Combiner.

Now lets look at the input you are passing, this will instantiate only one Mapper. The output of the Mapper is:

(to,1), (be,1), (or,1), (not,1), (to,1), (be,1)

When this goes to the Combiner, which is essentially same logic as Reducer. The output will be:

(be,2) , (to,2) , (or,1) , (not,1)

Now the above output of Combiner goes to the Reducer and it will perform the sum operation however you define it. So if your logic is sum += 1 then output will be:

(be,1) , (to,1) , (or,1) , (not,1)

But if your logic is sum += val.get() then your output will be:

(be,2) , (to,2) , (or,1) , (not,1)

I hope you understand it now. The logic of the Combiner and Reducer is same, But the input which is coming to them for processing is different.

Upvotes: 1

Prabhu Moorthy

Reputation: 177

val.get(); return an int so basically both the codes are same. The reason we are using val.get() depends on the problem we are trying to solve. In your case we are sure that in the mapper each word is emitted as the key and the value as 1, so in the reducer you can be sure that val.get() will always return 1. Hence the hard coded integer value 1 gives the same result.

Also using the same reducer as the combiner function should not cause any problem. One of the scenario where the output would be with all words giving count as '1' would be when the number of reducers is set as 0 and the mapper output is written to the output path.

Upvotes: 0

Hadoop - Classic MapReduce Wordcount

Answers (3)

Related Questions