How to run Python mapreduce in Hadoop Streaming

Question

I'm trying to run a mapreduce program in Apache Hadoop that counts an average sum of primes in a given input file. This is my Mapper

import sys
for word in sys.stdin:
   print(word)

And this is the Reducer

import sys
primes = []
for word in sys.stdin:
    if(int(word) >= 2):
        isPrime = True
        for a in range(2,int(word)):
            if(int(word) % a == 0):
                isPrime=False
        if isPrime:
            primes.append(int(word))
print(sum(primes)/float(len(primes)))

Now, when I run it with a following command:

python primesMapper.py primesReducer.py -r hadoop --hadoop-streaming-jar /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.10.0.jar hdfs://bigdata1.sis.uta.fi:/user/students/input/primes1.txt --output-dir group25/primes.txt

I get no error, but nothing actually happens. It just gets stuck on the command. When I manually terminate it I can see it's stuck in the Mapper file:

  File "primesMapper.py", line 8, in 
    for line in sys.stdin:

Any help?

How to run Python mapreduce in Hadoop Streaming

Answers (1)

Related Questions