Reputation: 91959

Finding Top-K records in a dataset

In attempt to learn Hadoop, I am practicing unsolved programming questions from the book "Hadoop in Action"

Dataset Sample:

3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,, 3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,, 3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,, 3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,, 3070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,, 3070806,1963,1096,,"US","PA",,1,,2,6,63,,0,,,,,,,,, 3070807,1963,1096,,"US","OH",,1,,623,3,39,,3,,0.4444,,,,,,, 3070808,1963,1096,,"US","IA",,1,,623,3,39,,4,,0.375,,,,,,, 3070809,1963,1096,,"US","AZ",,1,,4,6,65,,0,,,,,,,,, 3070810,1963,1096,,"US","IL",,1,,4,6,65,,3,,0.4444,,,,,,,

Map Function

public static class MapClass extends MapReduceBase implements Mapper<Text, Text, IntWritable, Text> {
        private int maxClaimCount = 0;
        private Text record = new Text();

        public void map(Text key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            String claim = value.toString().split(",")[7];
            //if (!claim.isEmpty() && claim.matches("\\d")) {
            if (!claim.isEmpty()) {
                int claimCount = Integer.parseInt(claim);
                if (claimCount > maxClaimCount) {
                    maxClaimCount = claimCount;
                    record = value;
                    output.collect(new IntWritable(claimCount), value);
                }
//              output.collect(new IntWritable(claimCount), value);
            }
        }

    }

Reduce Function

public static class Reduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {

    public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            output.collect(key, values.next()); 
    }
}

Command to Run:

hadoop jar ~/Desktop/wc.jar com/hadoop/patent/TopKRecords -Dmapred.map.tasks=7 ~/input  ~/output

Requirement:
- Based on the ninth column value, find the top-K records(say 7) from dataset

Question:
- Since just 7 top records are needed I run seven map tasks and make sure that I get the highest number record as maxClaimCount and record
- I do not know how to collect just the maximum record so that each map emits just one output

How do I do that?

Upvotes: 0

Answers (3)

Ashok Kumar

Reputation: 3

You can use top k design patterns for more details refer the below blog Findin Top K records in Mapreduce

Upvotes: 0

twid

Reputation: 6686

You can use TreeMap, which stores the key in a sorted manner. Mapper would be

public Mapper() {
   TreeMap<String, String> set = new TreeMap<String, String>();
   Void map(object key, Text value){
     Set.put("get key", value);
     If(set.size > 7) {
        Set.removeFirst()
     }

   }

Public void cleanup(){
  While(Entry<string, string> entry : map.entrySet()) {
    Conetext.write(entry.key, entry.value);
  }
}}

Upvotes: 0

Alex Gitelman

Reputation: 24722

This is an updated answer. All comments are not applicable to it as they are based on original (incorrect) answer.

Mapper should only output

output.collect(new IntWritable(claimCount), value);

without any comparison. Result will be sorted based on claim count and passed to reducer.

In Reducer use some priority queue to pick up top 7 results.

Upvotes: 3

Finding Top-K records in a dataset

Answers (3)

Related Questions