Sahil
Sahil

Reputation: 9488

Clarification regarding this map reduce word count example?

I am studying map reduce, and I have a question regarding the basic word count example of map reduce. Say my text is

My name is X Y X.

here is the map class, I am referring to

  public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, 
                OutputCollector<Text, IntWritable> output, 
                Reporter reporter) throws IOException {
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, one);
  }
}

}

When text is processed by this Map job, it will result into

My 1
name 1
is 1
X 1
Y 1
X 1     

Then after shuffer and sort, all of the same keys will be grouped and we can do the addition for the final count. In this example both of X's will be added.

My question is that, what if I do the addition in the map job itself, by keeping a map of word and count. Then then iterating over the map, and putting the count in the output. Will it have an impact on the map reduce job? The output will still be the same; However, will it be more efficient doing it like that, as there will be less entries for shuffle,sort and reducer to operate on?

Is my thinking of doing the addition in the map job correct?

Upvotes: 2

Views: 73

Answers (1)

Vlad
Vlad

Reputation: 9481

Yes you should keep your Map output as small as possible. Doing preliminary count will reduce amount of data moving through the system. Note you still need a reduce job that adds the counts for each word, your input could be split at Y so both "X" words would go to different mappers.

Also, another good efficiency thing you can do for your MapReduce job is to use Combiners. These are reduce steps that right on the mapper node right after map step completes. Thus you can reduce your Map job output even more.

Upvotes: 1

Related Questions