Reputation: 3612
I have two small python scripts
CountWordOccurence_mapper.py
#!/usr/bin/env python
import sys
#print(sys.argv[1])
text = sys.argv[1]
wordCount = text.count(sys.argv[2])
#print (sys.argv[2],wordCount)
print '%s\t%s' % (sys.argv[2], wordCount)
PrintWordCount_reducer.py
#!/usr/bin/env python
import sys
finalCount = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t')
count=int(count)
finalCount += count
print(word,finalCount)
I executed the same as follows :
$ ./CountWordOccurence_mapper.py \
"I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
"Accord" \
| /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)
As seen, my objective is dumb - count the no. of occurrences of the supplied word(in this case, Accord) in the given text.
Now, I intend to execute the same using Hadoop streaming. The text file on HDFS(partial) is :
"message" : "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it. Yesterday I tried to start it unsuccessfully. After hours at the auto mechanics it was found that there was a glitch in the electric/computer system. The news was disappointing enough (and expensive) but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful. When I bought a NEW Honda I thought I bought quality. I was wrong! Will Honda step up?"
I modified the CountWordOccurence_mapper.py
#!/usr/bin/env python
import sys
for text in sys.stdin:
wordCount = text.count(sys.argv[1])
print '%s\t%s' % (sys.argv[1], wordCount)
My first confusion was - how to send the word to be counted e.g "Accord", "Honda" as an argument to the mapper(-cmdenv name=value) only confused me. I still went ahead and executed the following command :
$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
-input /random_data/honda_service_within_warranty.txt \
-output /random_op/cnt.txt \
-file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
-file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
-reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py
As expected, the job failed and I got the following error:
Traceback (most recent call last):
File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Please correct the syntactical and basic mistakes that I have committed.
Thanks and regards !
Upvotes: 1
Views: 1319
Reputation: 3612
Actually, I had tried with -cmdenv wordToBeCounted="Accord" but the issue was with my python file - I forgot to make changes to it to read the value "Accord" from the environment variable(and NOT from the argument array). I'm attaching the code for CountWordOccurence_mapper.py, just in case anyone would like to use it for reference :
#!/usr/bin/env python
import sys
import os
wordToBeCounted = os.environ['wordToBeCounted']
for text in sys.stdin:
wordCount = text.count(wordToBeCounted)
print '%s\t%s' % (wordToBeCounted,wordCount)
Thanks and regards !
Upvotes: 0
Reputation: 30089
I think you're problem lies in the following portion of your command line invocation:
-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord"
I think your assumption here is that the string "Accord" is being passed as the first argument to the mapper. I'm pretty sure this isn't the case, in fact the string "Accord" is most probably being ignored by the streaming driver entry point class (StreamJob.java).
To fix this you'll want to go back to using the -cmdenv
parameter and then extract this key/value pair in your python code (I'm not a python programmer but i'm sure a quick Google will point you towards the snippet you need).
Upvotes: 1