Reputation: 1321
I am trying to analyze the social network data which contains follower
and followee
pairs. I want to find the top 10 users who have the most followees using MapReduce.
I made pairs of userID
and number_of_followee
with one MapReduce step.
With this data, however, I am not sure how to sort them in distributed systems.
I am not sure how priority queue
can be used in either of Mappers and Reducers since they have the distributed data.
Can someone explain me how I can use data structures to sort the massive data?
Thank you very much.
Upvotes: 0
Views: 219
Reputation: 5557
To Sort the data in descending order, you need another mapreduce
job. The Mapper would emit "number of followers" as key and twitter handle as value.
class SortingMap extends Map<LongWritable, Text, LongWritable, Text> {
private Text value = new Text();
private LongWritable key = new LongWritable(0);
@Overwrite
public void map(LongWritable key, Text value, Context context) throws IOException {
String line = value.toString();
// Assuming that the input data is "TweeterId <number of follower>" separated by tab
String tokens[] = value.split(Pattern.quote("\t"));
if(tokens.length > 1) {
key.set(Long.parseLong(tokens[1]));
value.set(tokens[0]);
context.write(key, value);
}
}
}
For reducer, use IdentityReducer<K,V>
// SortedComparator Class
public class DescendingOrderKeyComparator extends WritableComparator {
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
return -1 * w1.compareTo(w2);
}
}
In the Driver Class, set SortedComparator
job.setSortComparatorClass(DescendingOrderKeyComparator.class);
Upvotes: 1
Reputation: 11619
If you have big input file (files) of format user_id = number_of_followers
, simple map-reduce algorithm to find top N
users is:
number_of_mappers * N
rows and finds top N users among themUpvotes: 1