Hadoop streaming with multiple python files

Question

I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here is an example of my hadoop streaming command

hadoop jar $streamingJar \
    -D mapreduce.map.memory.mb=4096 \
    -files preprocess.py,parse.py \
    -input $input \
    -output $output \
    -mapper "python parse.py" \
    -reducer NONE

And here is the first line in parse.py

from preprocess import normalize_large_text, normalize_small_text

When I run the command through hadoop streaming I see the following output in the logs

Traceback (most recent call last):
  File "preprocess.py", line 1, in 
    from preprocess import normalize_large_text, normalize_small_text, normalize_skill_cluster
ImportError: No module named preprocess

My understanding is that hadoop put all the files in the same directory. If this is true then I don't see how this could fail. Does anyone know what's going on?

Thanks

bearrito · Accepted Answer

You need to add the scripts to the same directory and add them using files flag.

hadoop jar $streamingJar -D mapreduce.map.memory.mb=4096 -files python_files 
-input $input -output $output -mapper "python_files\python parse.py" -reducer NONE

Hadoop streaming with multiple python files

Answers (1)

Related Questions