Srini Subramanian
Srini Subramanian

Reputation: 161

Global values in hadoop map reduce

My use case involves finding out defective items ...say I have a product list of million of items in hdfs marked good/defective....I want to find out the first 10 matches of defective items and then stop.

I was thinking of using counters to do this, but it looks like counters are all at the task tracker level...so every task tracker maintains it's own copy of the counter, which is not really aggregated till the job completes. So a counter running in a map job of one the splits won't have any idea if another map has already found the 10 items.

Any idea on how to resolve this?

Upvotes: 0

Views: 637

Answers (2)

Praveen Sripati
Praveen Sripati

Reputation: 33555

Find the local top 10 records in the map tasks and send them to the reducer. So, if there are 7 mappers, then the reducer will be getting 70 records. The reducer has to sort those 70 records and emit the global top 10 records. Here is the code for the same.

Note that this approach will work with only a single reducer and not with 1+ reducers and this might be a bottleneck. Also, there is no communication between the mappers, so there is no way to reduce the burden on the reducer. Check this papers, where the mappers can talk to each other using global data. IBM BigInsights implements it.

Check this blog entry for much patterns.

Upvotes: 1

Pai
Pai

Reputation: 85

Assuming you are using Hadoop, counters are also available globally.

However, I do not understand the reason behind using Map Reduce for this problem.

Upvotes: 0

Related Questions