MapReduce sorting with heap

Question

I am trying to analyze the social network data which contains follower and followee pairs. I want to find the top 10 users who have the most followees using MapReduce.

I made pairs of userID and number_of_followee with one MapReduce step.

With this data, however, I am not sure how to sort them in distributed systems.

I am not sure how priority queue can be used in either of Mappers and Reducers since they have the distributed data.

Can someone explain me how I can use data structures to sort the massive data?

Thank you very much.

AdamSkywalker · Accepted Answer

If you have big input file (files) of format user_id = number_of_followers, simple map-reduce algorithm to find top N users is:

each mapper processes its own input and finds top N users in its file, writes them to a single reducer
single reducer receives number_of_mappers * N rows and finds top N users among them

MapReduce sorting with heap

Answers (2)

Related Questions