Naman Bhargava
Naman Bhargava

Reputation: 59

How to find one specific key value pair as output from reducer

I need to find the student with max marks using MR

Paul 90
Ben 20
Cook 80
Joe 85

So output of reducer should be (Paul 90)

can anyone help me with this?

Upvotes: 0

Views: 1394

Answers (3)

Ghislain Fourny
Ghislain Fourny

Reputation: 7279

You can map all input tuples to the same key, with a value being the same as each input tuple, like (the-one-key, (Ben, 20)), and use a reduce function that returns only the tuple that has the maximum grade (since there is only one key).

To make sure that MR parallelism kicks in, using a combiner with the same function as the reducer (above) should do the trick. That way, the reducer will only get one tuple from each mapper and will have less work to do.

Edit: even better, you can already eliminate all but the max in the mapping function to get best performance (see Venkat's remark that combiners are not guaranteed to be used).

Example with two mappers:

Paul 90
Ben 20
Cook 80
Joe 85

Mapped to:

Mapper 1
(the-one-key, (Paul, 90))
(the-one-key, (Ben, 20))

Mapper 2
(the-one-key, (Cook, 80))
(the-one-key, (Joe, 85))

Combined to (still on the mappers' side):

Mapper 1
(the-one-key, (Paul, 90))

Mapper 2
(the-one-key, (Joe, 85))

Reduced to:

(the-one-key, (Paul, 90))

A final remark: MapReduce may be "too much" for this if you have a small data set. A simple scan in local memory would be faster if you only have a few hundreds or thousands values.

Upvotes: 1

ravi
ravi

Reputation: 1088

Take a look at the following code at gist:

https://gist.github.com/meshekhar/6dd773abf2af6ff631054facab885bf3

In mapper, data gets mapped to key value pair:

key: "Paul 90"
key: "Ben 20"
key: "Cook 80"
key: "Joe 85"

In reducer, iterating through all the records using while loop, each value is split into name and marks and max marks stored in temp variable.

And at the end, the max value and corresponding name pair are returned. e.g. Paul 90.

I tested it on a single node system with more than 1 million records, takes less than 10 sec.

Upvotes: 0

Venkat
Venkat

Reputation: 1810

A good way of doing this is to do a secondary sort in Hadoop. Your Map output key should be a combination of (Name, Marks).

You would then implement a custom comparator which can take this key & based on the Marks only compare 2 given values and sort based on higher marks.

Typically we implement a grouping comparator but in this case we would want all the keys to go into a single reducer. So we would ignore the key differences in the grouping comparator.

In the reducer just get the first value & exit.

Details of secondary sort : Secondary Sort

Upvotes: 1

Related Questions