Reputation: 63
This previous question addressed how to import modules such as nltk for hadoop streaming.
The steps outlined were:
zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
You can now import the nltk module for use in your Python script: import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')
I have a job that I want to run on Amazon's EMR, and I'm not sure where to put the zipped files. Do I need to create a bootstrapping script under boostrapping options or should I put the tar.gz's in S3 and then in extra args? I'm pretty new to all this and would appreciate an answer that could walk me through the process would be much appreciated.
Upvotes: 4
Views: 2305
Reputation: 1945
You have following options:
Create bootstrap action script and place it on S3. This script would download module in whatever format you prefer and place it where it is accessible for your mapper/reducer. To find out the place where exactly you have to put the files, start the cluster in such a way that it will not shut down after completion, ssh there and examine directory structure.
Use mrjob to launch your jobflows. When starting job with mrjob is possible to specify bootstrap_python_packages which mrjob will install automatically by uncompressing .tar.gz and running setup.py install.
http://packages.python.org/mrjob/configs-runners.html
I would prefer option 2 because mrjob also helps a lot in developing MapReduce jobs in Python. In particular it allows to run the jobs locally (with or without Hadoop) as well as on EMR which simplifies debugging.
Upvotes: 2