chapstick
chapstick

Reputation: 743

How to do average in mapreduce

Problem: we want to take the average of salaries stored in a text file. Assume that the file contains firstname, lastname and salary. Lets say we want to do this for all companies, all sizes in US. A new file is started for a new day ie all salaries entered on April 29 are in a file titled April29.txt and all salaries entered on April 30 are in file titled April30.text and so on. You can imagine each day the line numbers will be different.

Goal: Calculate the average salary per file using mapreduce.

Now everywhere I look the overall suggestion to do average is this: map reads one line at a time and outputs "key", value because there is only one key - "key" all output goes to ONE reducer where we use a for loop to compute the average.

This approach is great except that the bigger the file gets the worst the computation time becomes. Is there a way to improve this situation? I didn't find an example that address this situation but if you know some please share a link. Thanks in advance.

Upvotes: 2

Views: 2969

Answers (2)

kanishka vatsa
kanishka vatsa

Reputation: 2314

I was curious as can we do it using the counters provided by hadoop. Say we build two counters like

public enum CountCounters { Counter }

public enum SumCounters { Counter }

From within the map method of our mapper, we can access the counter and increment it.

context.getCounter(CountCounters.Counter).increment(1); context.getCounter(SumCounters.Counter).increment();

and at last we will

job.getCounters().findCounter(CountCounters.Counter).getValue(); job.getCounters().findCounter(SumCounters.Counter).getValue();

and find the average

Upvotes: 0

Mike Park
Mike Park

Reputation: 10931

This can definitely be done more efficiently.

Now, we know the Mapper has a map method you can override. However, it also has a cleanup. Looking at the source of the mapper, you see this:

public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  while (context.nextKeyValue()) {
    map(context.getCurrentKey(), context.getCurrentValue(), context);
  }
  cleanup(context);
}

So we can use this cleanup method to optimize our average code a bit.

First you need a custom writable that stores two things, count and sum. Let's call it AverageWritable. Then, we'll do something like this in the mapper:

AverageWritable avg = new AverageWritable();
public void map(LongWritable key, Text value, Context ctx) {
    long salary = [ ... code to get salary... ]
    avg.addCount(1);
    avg.addSum(salary);
}

public void cleanup(Context ctx) {
    ctx.write(CONSTANT_KEY, avg);
}

The reducer and combiner code should be easy enough to figure out from here.

Upvotes: 3

Related Questions