Reputation: 4045
I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here is an example of my hadoop streaming command
hadoop jar $streamingJar \
-D mapreduce.map.memory.mb=4096 \
-files preprocess.py,parse.py \
-input $input \
-output $output \
-mapper "python parse.py" \
-reducer NONE
And here is the first line in parse.py
from preprocess import normalize_large_text, normalize_small_text
When I run the command through hadoop streaming I see the following output in the logs
Traceback (most recent call last):
File "preprocess.py", line 1, in <module>
from preprocess import normalize_large_text, normalize_small_text, normalize_skill_cluster
ImportError: No module named preprocess
My understanding is that hadoop put all the files in the same directory. If this is true then I don't see how this could fail. Does anyone know what's going on?
Thanks
Upvotes: 1
Views: 2031
Reputation: 2316
You need to add the scripts to the same directory and add them using files flag.
hadoop jar $streamingJar -D mapreduce.map.memory.mb=4096 -files python_files
-input $input -output $output -mapper "python_files\python parse.py" -reducer NONE
Upvotes: 3