Reputation: 59
I need to find the student with max marks using MR
Paul 90
Ben 20
Cook 80
Joe 85
So output of reducer should be (Paul 90)
can anyone help me with this?
Upvotes: 0
Views: 1394
Reputation: 7279
You can map all input tuples to the same key, with a value being the same as each input tuple, like (the-one-key, (Ben, 20)), and use a reduce function that returns only the tuple that has the maximum grade (since there is only one key).
To make sure that MR parallelism kicks in, using a combiner with the same function as the reducer (above) should do the trick. That way, the reducer will only get one tuple from each mapper and will have less work to do.
Edit: even better, you can already eliminate all but the max in the mapping function to get best performance (see Venkat's remark that combiners are not guaranteed to be used).
Example with two mappers:
Paul 90
Ben 20
Cook 80
Joe 85
Mapped to:
Mapper 1
(the-one-key, (Paul, 90))
(the-one-key, (Ben, 20))
Mapper 2
(the-one-key, (Cook, 80))
(the-one-key, (Joe, 85))
Combined to (still on the mappers' side):
Mapper 1
(the-one-key, (Paul, 90))
Mapper 2
(the-one-key, (Joe, 85))
Reduced to:
(the-one-key, (Paul, 90))
A final remark: MapReduce may be "too much" for this if you have a small data set. A simple scan in local memory would be faster if you only have a few hundreds or thousands values.
Upvotes: 1
Reputation: 1088
Take a look at the following code at gist:
https://gist.github.com/meshekhar/6dd773abf2af6ff631054facab885bf3
In mapper, data gets mapped to key value pair:
key: "Paul 90"
key: "Ben 20"
key: "Cook 80"
key: "Joe 85"
In reducer, iterating through all the records using while loop, each value is split into name and marks and max marks stored in temp variable.
And at the end, the max value and corresponding name pair are returned. e.g. Paul 90.
I tested it on a single node system with more than 1 million records, takes less than 10 sec.
Upvotes: 0
Reputation: 1810
A good way of doing this is to do a secondary sort in Hadoop. Your Map output key should be a combination of (Name, Marks).
You would then implement a custom comparator which can take this key & based on the Marks only compare 2 given values and sort based on higher marks.
Typically we implement a grouping comparator but in this case we would want all the keys to go into a single reducer. So we would ignore the key differences in the grouping comparator.
In the reducer just get the first value & exit.
Details of secondary sort : Secondary Sort
Upvotes: 1