John Knight
John Knight

Reputation: 3253

Error in loading a file into EMR distributed cache using elastic-mapreduce

I'm using below command to launch a cluster.

./elastic-mapreduce --create \
 --stream \
 --cache  s3n://bucket_name/code/totalInstallUsers#totalInstallUsers \
 --input s3n://bucket_name/input \
 --output s3n://bucket_name/output \
 --mapper s3n://bucket_name/code/mapper.py \
 --reducer s3n://bucket_name \
 --jobflow-role EMR_EC2_DefaultRole \
 --service-role EMR_DefaultRole \
 --debug \
 --log-uri s3n://bucket_name/logs

and I always got below error message. If I remove the --cache statement, the cluster will be launched successfully. Error: undefined method each' for #<String:0x00000002c28ba0> /home/ubuntu/data_processing/commands.rb:806:insteps' /home/ubuntu/data_processing/commands.rb:1232:in block in enact' /home/ubuntu/data_processing/commands.rb:1232:inmap' /home/ubuntu/data_processing/commands.rb:1232:in enact' /home/ubuntu/data_processing/commands.rb:49:inblock in enact' /home/ubuntu/data_processing/commands.rb:49:in each' /home/ubuntu/data_processing/commands.rb:49:inenact' /home/ubuntu/data_processing/commands.rb:2422:in create_and_execute_commands' /home/ubuntu/data_processing/elastic-mapreduce-cli.rb:13:in' /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in require' /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:inrequire' ./elastic-mapreduce:6:in `'

Reason to use --cache is that I wish from mapper.py I can open the datafile via "with open('./totalInstallUsers', 'r') as infile:

could anyone give me a clue? thanks

Upvotes: 1

Views: 293

Answers (1)

John Knight
John Knight

Reputation: 3253

Here to post the solution I got, hopefully helpful for others. Using AWS EMR, the command look like:

aws emr create-cluster 
    --name "cluster--name" 
    --enable-debugging 
    --log-uri s3://bucket-name/logs 
    --ami-version 3.7.0 
    --use-default-roles 
    --ec2-attributes KeyName=your-key 
    --instance-type m3.xlarge 
    --instance-count 3 
    --auto-terminate 
    --steps file://./streaming.json

And in Streaming.json, it looks like: 
[ 
    { 
    "Type": "STREAMING", 
    "Name": "Streaming program", 
    "ActionOnFailure": "TERMINATE_CLUSTER", 
    "Args": [ 
            "-files","s3://bucket-name/code/mapper.py,s3://bucket-name/code/reducer.py", 
            "-mapper","mapper.py", 
            "-reducer","reducer.py", 
            "-input","s3://bucket-name/input", 
            "-output","s3://bucket-name/output", 
            "-cacheFile", "s3://bucket_name/code/data-file-name#new-file-name" 
            ] 
    } 
] 

Upvotes: 1

Related Questions