Reputation: 91959
In attempt to learn Hadoop, I am practicing unsolved programming questions from the book "Hadoop in Action"
Dataset Sample:
3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,, 3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,, 3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,, 3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,, 3070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,, 3070806,1963,1096,,"US","PA",,1,,2,6,63,,0,,,,,,,,, 3070807,1963,1096,,"US","OH",,1,,623,3,39,,3,,0.4444,,,,,,, 3070808,1963,1096,,"US","IA",,1,,623,3,39,,4,,0.375,,,,,,, 3070809,1963,1096,,"US","AZ",,1,,4,6,65,,0,,,,,,,,, 3070810,1963,1096,,"US","IL",,1,,4,6,65,,3,,0.4444,,,,,,,
Map Function
public static class MapClass extends MapReduceBase implements Mapper<Text, Text, IntWritable, Text> {
private int maxClaimCount = 0;
private Text record = new Text();
public void map(Text key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
String claim = value.toString().split(",")[7];
//if (!claim.isEmpty() && claim.matches("\\d")) {
if (!claim.isEmpty()) {
int claimCount = Integer.parseInt(claim);
if (claimCount > maxClaimCount) {
maxClaimCount = claimCount;
record = value;
output.collect(new IntWritable(claimCount), value);
}
// output.collect(new IntWritable(claimCount), value);
}
}
}
Reduce Function
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
output.collect(key, values.next());
}
}
Command to Run:
hadoop jar ~/Desktop/wc.jar com/hadoop/patent/TopKRecords -Dmapred.map.tasks=7 ~/input ~/output
Requirement:
- Based on the ninth column value, find the top-K records(say 7) from dataset
Question:
- Since just 7 top records are needed I run seven map tasks and make sure that I get the highest number record as maxClaimCount
and record
- I do not know how to collect just the maximum record so that each map emits just one output
How do I do that?
Upvotes: 0
Views: 1306
Reputation: 3
You can use top k design patterns for more details refer the below blog Findin Top K records in Mapreduce
Upvotes: 0
Reputation: 6686
You can use TreeMap, which stores the key in a sorted manner. Mapper would be
public Mapper() {
TreeMap<String, String> set = new TreeMap<String, String>();
Void map(object key, Text value){
Set.put("get key", value);
If(set.size > 7) {
Set.removeFirst()
}
}
Public void cleanup(){
While(Entry<string, string> entry : map.entrySet()) {
Conetext.write(entry.key, entry.value);
}
}}
Upvotes: 0
Reputation: 24722
This is an updated answer. All comments are not applicable to it as they are based on original (incorrect) answer.
Mapper should only output
output.collect(new IntWritable(claimCount), value);
without any comparison. Result will be sorted based on claim count and passed to reducer.
In Reducer use some priority queue to pick up top 7 results.
Upvotes: 3