How to pass arguments to streaming job on Amazon EMR

I want to produce the output of my map function, filtering the data by dates.

In local tests, I simply call the application passing the dates as parameters as:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py

Then the parameters are taken in the map function

#!/usr/bin/python
date1 = sys.argv[1];
date2 = sys.argv[2];

The question is: How do I pass the date parameters to the map calling on Amazon EMR?

I am a beginner in Map reduce. Will appreciate any help.

Upvotes: 1

Answers (1)

ohad edelstain

Reputation: 1645

First of all, When you run a local test, and you should as often as possible. the correct format (in order to reproduce how map-reduce works) is:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 | sort | ./reducer.py | sort

That the way the hadoop framework works.
If you are looking on a big file, you should do it in steps to verify results of each line.
meaning:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 > map_result.txt
cat map_result.txt | sort > map_result_sorted.txt
cat map_result_sorted.txt | ./reducer.py > reduce_result.txt
cat reduce_result.txt | sort > map_reduce_result.txt

In regard to your main question:
Its the same thing.

If you are going to use the amazon web console to create your cluster, in the add step window you just write as fallowing:

name: learning amazon emr
Mapper: (here they say: please give us s3 path to your mapper, we will ignore that, and just write our script name and parameters, no backslash...) mapper.py 20/12/2014 31/12/2014
Reducer: (the same as in the mapper) reducer.py (you can add here params too)
Input location: ...
Output location: ... (just remember to use a new output every time, or your task will fail)
Arguments: -files s3://cod/mapper.py,s3://cod/reducer.py (use your file path here, even if you add only one file use the -files argument)

That's it

If you are going into the all argument thing, i suggest you see this guy blog on how to use the passing of arguments in order to use only a single map,reduce file.

Hope it helped

Upvotes: 1

How to pass arguments to streaming job on Amazon EMR

Answers (1)

Related Questions