Hadoop streaming to invoke python script

Question

I have two small python scripts

CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

#print(sys.argv[1])

text = sys.argv[1]

wordCount = text.count(sys.argv[2])

#print (sys.argv[2],wordCount)
print '%s	%s' % (sys.argv[2], wordCount)

PrintWordCount_reducer.py

#!/usr/bin/env python
import sys

finalCount = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('	')

    count=int(count)

    finalCount += count

    print(word,finalCount)

I executed the same as follows :

$ ./CountWordOccurence_mapper.py \
    "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...

BUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
    "Accord" \
     | /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)

As seen, my objective is dumb - count the no. of occurrences of the supplied word(in this case, Accord) in the given text.

Now, I intend to execute the same using Hadoop streaming. The text file on HDFS(partial) is :

"message" : "I am a Honda customer 100%.. 94 Accord ex   96 Accord exV6  98 Accord exv6 cpe  2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...

BUT.... Honda lawnmower   motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it.  Yesterday   I tried to start it unsuccessfully.  After hours at the auto mechanics  it was found that there was a glitch in the electric/computer system.  The news was disappointing enough (and expensive)    but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful.  When I bought a NEW Honda I thought I bought quality.  I was wrong! Will Honda step up?"

I modified the CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

for text in sys.stdin:  

    wordCount = text.count(sys.argv[1])

    print '%s	%s' % (sys.argv[1], wordCount)

My first confusion was - how to send the word to be counted e.g "Accord", "Honda" as an argument to the mapper(-cmdenv name=value) only confused me. I still went ahead and executed the following command :

$HADOOP_HOME/bin/hadoop jar \
  $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
 -input /random_data/honda_service_within_warranty.txt \
 -output /random_op/cnt.txt \
 -file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
 -mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
 -file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
 -reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py

As expected, the job failed and I got the following error:

Traceback (most recent call last):
  File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in 
    wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Please correct the syntactical and basic mistakes that I have committed.

Thanks and regards !

Kaliyug Antagonist · Accepted Answer

Actually, I had tried with -cmdenv wordToBeCounted="Accord" but the issue was with my python file - I forgot to make changes to it to read the value "Accord" from the environment variable(and NOT from the argument array). I'm attaching the code for CountWordOccurence_mapper.py, just in case anyone would like to use it for reference :

#!/usr/bin/env python
import sys
import os

wordToBeCounted = os.environ['wordToBeCounted']

for text in sys.stdin:  

    wordCount = text.count(wordToBeCounted)

    print '%s	%s' % (wordToBeCounted,wordCount)

Thanks and regards !

Hadoop streaming to invoke python script

Answers (2)

Related Questions