Jon
Jon

Reputation: 4045

Hadoop streaming with multiple python files

I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here is an example of my hadoop streaming command

hadoop jar $streamingJar \
    -D mapreduce.map.memory.mb=4096 \
    -files preprocess.py,parse.py \
    -input $input \
    -output $output \
    -mapper "python parse.py" \
    -reducer NONE

And here is the first line in parse.py

from preprocess import normalize_large_text, normalize_small_text

When I run the command through hadoop streaming I see the following output in the logs

Traceback (most recent call last):
  File "preprocess.py", line 1, in <module>
    from preprocess import normalize_large_text, normalize_small_text, normalize_skill_cluster
ImportError: No module named preprocess

My understanding is that hadoop put all the files in the same directory. If this is true then I don't see how this could fail. Does anyone know what's going on?

Thanks

Upvotes: 1

Views: 2031

Answers (1)

bearrito
bearrito

Reputation: 2316

You need to add the scripts to the same directory and add them using files flag.

hadoop jar $streamingJar -D mapreduce.map.memory.mb=4096 -files python_files 
-input $input -output $output -mapper "python_files\python parse.py" -reducer NONE

Upvotes: 3

Related Questions