Jiho Choi
Jiho Choi

Reputation: 1321

MapReduce sorting with heap

I am trying to analyze the social network data which contains follower and followee pairs. I want to find the top 10 users who have the most followees using MapReduce.

I made pairs of userID and number_of_followee with one MapReduce step.

With this data, however, I am not sure how to sort them in distributed systems.

I am not sure how priority queue can be used in either of Mappers and Reducers since they have the distributed data.

Can someone explain me how I can use data structures to sort the massive data?

Thank you very much.

Upvotes: 0

Views: 219

Answers (2)

Gyanendra Dwivedi
Gyanendra Dwivedi

Reputation: 5557

To Sort the data in descending order, you need another mapreduce job. The Mapper would emit "number of followers" as key and twitter handle as value.

class SortingMap extends Map<LongWritable, Text, LongWritable, Text> {
    private Text value = new Text();
    private LongWritable key = new LongWritable(0);

    @Overwrite
    public void map(LongWritable key, Text value, Context context) throws IOException {
        String line = value.toString();
        // Assuming that the input data is "TweeterId <number of follower>" separated by tab
        String tokens[] = value.split(Pattern.quote("\t"));
        if(tokens.length > 1) {
            key.set(Long.parseLong(tokens[1]));
            value.set(tokens[0]);
            context.write(key, value);
        }
    }
}

For reducer, use IdentityReducer<K,V>

// SortedComparator Class

public class DescendingOrderKeyComparator extends WritableComparator {
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
        return -1 * w1.compareTo(w2);
    }
}

In the Driver Class, set SortedComparator

job.setSortComparatorClass(DescendingOrderKeyComparator.class);

Upvotes: 1

AdamSkywalker
AdamSkywalker

Reputation: 11619

If you have big input file (files) of format user_id = number_of_followers, simple map-reduce algorithm to find top N users is:

  1. each mapper processes its own input and finds top N users in its file, writes them to a single reducer
  2. single reducer receives number_of_mappers * N rows and finds top N users among them

Upvotes: 1

Related Questions