Reputation: 2457
I'm pretty new to MapReduce. Currently trying to complete the udacity course on Hadoop MapReduce.
I have a mapper that parses a forum nodes, and I will get the tags associated with each node. My objective is to sort the top 10 tags.
An example output:
video 1
cs101 1
meta 1
bug 1
issues 1
nationalities 1
cs101 1
welcome 1
cs101 1
cs212 1
cs262 1
cs253 1
discussion 1
meta 1
It is pretty easy to add them all up in reducer:
#!/usr/bin/python
import sys
import string
total = 0
oldKey = None
for line in sys.stdin:
data_mapped = line.strip().split("\t")
if(len(data_mapped) != 2):
print "====================="
print line.strip()
print "====================="
continue
key, value = data_mapped
if oldKey and oldKey != key:
print total, "\t", oldKey
oldKey = key;
total = 0
oldKey = key
total += float(value)
if oldKey != None:
print total, "\t", oldKey
Output:
1.0 application
1.0 board
1.0 browsers
1.0 bug
8.0 cs101
1.0 cs212
5.0 cs253
1.0 cs262
1.0 deadlines
1.0 digital
5.0 discussion
1.0 google-appengine
2.0 homework
1.0 html
1.0 hungarian
1.0 hw2-1
3.0 issues
2.0 jobs
2.0 lessons
I know that the keys are sorted in the output of a mapper, hence I just test if the keys are still the same tag. If not, then I'll output the # of times that a tag appears.
However, the problem is how do I sort this list?
I can sort the list in python if I store all the information in a list or a dictionary. However, it seems like a bad design decision, because if you have a lot of different tags, you will go out of memory.
Upvotes: 3
Views: 8957
Reputation: 19050
I believe you can use the collections.Counter class here:
Example: (modified from your code)
#!/usr/bin/python
import sys
import collections
counter = collections.Counter()
for line in sys.stdin:
k, v = line.strip().split("\t", 2)
counter[k] += int(v)
print counter.most_common(10)
The collections.Counter()
class implements this exact use-case and many other common use-cases around counting things and collecting various stats, etc.
8.3.1. Counter objects A counter tool is provided to support convenient and rapid tallies. For example:
Upvotes: 3