Reputation: 913
I am trying to write a mapreduce program, this is the map part, that returns bigrams or adjacent word pairs from a stdin text.
This is my concept/half-pseudo:
for line in sys.stdin:
line = line.strip()
words = line.split()
for pair in words: #HERE***
print '%s\t%s' % (pair,1)
How can I extract an adjacent pair of words so that I can output all the adjacent word pairs such as "word1 word2, 1" so that in my reducer I can combine them? I'd like to keep the format as close to this as possible.
Thank you.
Upvotes: 0
Views: 156
Reputation: 82450
You can pair them like so:
from itertools import tee
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
Upvotes: 2