Reputation: 743
Problem: we want to take the average of salaries stored in a text file. Assume that the file contains firstname, lastname and salary. Lets say we want to do this for all companies, all sizes in US. A new file is started for a new day ie all salaries entered on April 29 are in a file titled April29.txt and all salaries entered on April 30 are in file titled April30.text and so on. You can imagine each day the line numbers will be different.
Goal: Calculate the average salary per file using mapreduce.
Now everywhere I look the overall suggestion to do average is this: map reads one line at a time and outputs "key", value because there is only one key - "key" all output goes to ONE reducer where we use a for loop to compute the average.
This approach is great except that the bigger the file gets the worst the computation time becomes. Is there a way to improve this situation? I didn't find an example that address this situation but if you know some please share a link. Thanks in advance.
Upvotes: 2
Views: 2969
Reputation: 2314
I was curious as can we do it using the counters provided by hadoop. Say we build two counters like
public enum CountCounters { Counter }
public enum SumCounters { Counter }
From within the map method of our mapper, we can access the counter and increment it.
context.getCounter(CountCounters.Counter).increment(1); context.getCounter(SumCounters.Counter).increment();
and at last we will
job.getCounters().findCounter(CountCounters.Counter).getValue(); job.getCounters().findCounter(SumCounters.Counter).getValue();
and find the average
Upvotes: 0
Reputation: 10931
This can definitely be done more efficiently.
Now, we know the Mapper
has a map
method you can override. However, it also has a cleanup
. Looking at the source of the mapper, you see this:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
So we can use this cleanup method to optimize our average code a bit.
First you need a custom writable that stores two things, count
and sum
. Let's call it AverageWritable
. Then, we'll do something like this in the mapper:
AverageWritable avg = new AverageWritable();
public void map(LongWritable key, Text value, Context ctx) {
long salary = [ ... code to get salary... ]
avg.addCount(1);
avg.addSum(salary);
}
public void cleanup(Context ctx) {
ctx.write(CONSTANT_KEY, avg);
}
The reducer and combiner code should be easy enough to figure out from here.
Upvotes: 3