user1403483
user1403483

Reputation:

Some elementary doubts about running Mapreduce programs using mrjob on Amazon EMR

I am new to mrjob and I am having problems to get the job running on Amazon EMR. I will write them in sequential order.

  1. I can run a mrjob on my local machine. However when I have mrjob.conf in /home/ankit/.mrjob.conf and in /etc/mrjob.conf, the job is not executed on my local machine. Here is what I am getting. https://s3-ap-southeast-1.amazonaws.com/imagna.sample/local.txt
  2. What is MRJOB_CONF in "the location specified by MR_CONF" in the documentation?
  3. What is the use of 'base_tmp_directory' ? Also, do I need to upload the input data in S3 before starting the job or it will load from my local computer while starting the execution?
  4. Do I need to do some bootstrapping if I use some libraries like numpy, scikit etc? If yes, how?
  5. This is what I am getting when I execute the command for running a job on EMR https://s3-ap-southeast-1.amazonaws.com/imagna.sample/emr.txt

Any solutions?

Thanks a lot.

Upvotes: 0

Views: 305

Answers (1)

John Wiseman
John Wiseman

Reputation: 3137

  1. Your URL is invalid (I get an "Access Denied" error).
  2. mrjob.conf is a configuration file. It can be located in several locations, see http://pythonhosted.org/mrjob/configs-conf.html
  3. You can use input data from your local machine just by specifying the paths to the input files on the command line. MRJob will upload the data to S3 for you. If you specify an s3://... URL, MRJob will use the data at that S3 path.
  4. To use non-standard packages, see http://pythonhosted.org/mrjob/writing-and-running.html#custom-python-packages
  5. Your URL is invalid (I get an "Access Denied" error).

Upvotes: 1

Related Questions