Reputation: 4995
I'm following this example here hoping to successfully run something using EC2/S3/EMR/R. https://gist.github.com/406824
The job fails on the Streaming Step. Here are the error logs:
controller:
2011-07-21T19:14:27.711Z INFO Fetching jar file.
2011-07-21T19:14:30.380Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2011-07-21T19:14:30.380Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp /home/hadoop/conf: /usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-0.20-core.jar:/home/hadoop/hadoop-0.20-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp -Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 org.apache.hadoop.util.RunJar /home/hadoop/contrib/streaming/hadoop-streaming.jar -cacheFile s3n://emrexample21/calculatePiFunction.R#calculatePiFunction.R -input s3n://emrexample21/numberList.txt -output s3n://emrout/ -mapper s3n://emrexample21/mapper.R -reducer s3n://emrexample21/reducer.R
2011-07-21T19:16:12.057Z INFO Execution ended with ret val 1
2011-07-21T19:16:12.057Z WARN Step failed with bad retval
2011-07-21T19:16:14.185Z INFO Step created jobs: job_201107211913_0001
stderr:
Streaming Command Failed!
stdout:
packageJobJar: [/mnt/var/lib/hadoop/tmp/hadoop-unjar2368654264051498521/] [] /mnt/var/lib/hadoop/steps/2/tmp/streamjob1658200878131882888.jar tmpDir=null
syslog:
2011-07-21 19:50:29,539 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: 2
2011-07-21 19:50:29,539 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 15
2011-07-21 19:50:31,988 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2011-07-21 19:50:31,999 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
2011-07-21 19:50:33,040 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1
2011-07-21 19:50:35,375 INFO org.apache.hadoop.streaming.StreamJob (main): getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
2011-07-21 19:50:35,375 INFO org.apache.hadoop.streaming.StreamJob (main): Running job: job_201107211948_0001
2011-07-21 19:50:35,375 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run:
2011-07-21 19:50:35,375 INFO org.apache.hadoop.streaming.StreamJob (main): UNDEF/bin/hadoop job -Dmapred.job.tracker=ip-10-203-50-161.ec2.internal:9001 -kill job_201107211948_0001
2011-07-21 19:50:35,376 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://ip-10-203-50-161.ec2.internal:9100/jobdetails.jsp?jobid=job_201107211948_0001
2011-07-21 19:50:36,566 INFO org.apache.hadoop.streaming.StreamJob (main): map 0% reduce 0%
2011-07-21 19:50:57,778 INFO org.apache.hadoop.streaming.StreamJob (main): map 50% reduce 0%
2011-07-21 19:51:09,839 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 0%
2011-07-21 19:51:12,852 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 1%
2011-07-21 19:51:15,864 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 3%
2011-07-21 19:51:18,875 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 0%
2011-07-21 19:52:12,454 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100%
2011-07-21 19:52:12,455 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run:
2011-07-21 19:52:12,455 INFO org.apache.hadoop.streaming.StreamJob (main): UNDEF/bin/hadoop job -Dmapred.job.tracker=ip-10-203-50-161.ec2.internal:9001 -kill job_201107211948_0001
2011-07-21 19:52:12,456 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://ip-10-203-50-161.ec2.internal:9100/jobdetails.jsp?jobid=job_201107211948_0001
2011-07-21 19:52:12,456 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not Successful!
2011-07-21 19:52:12,456 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...
Upvotes: 2
Views: 2856
Reputation: 60756
I'm the author of the code that you are trying to run. It was written as a proof of concept about R and EMR. It's very hard to make really useful code using that method. Submitting R code to EMR with all the manual steps required for that method to work properly is an exercise in tedious pain.
To get around the tedium, I later wrote the Segue package which abstracts away all of the loading of bits into S3 as well as the updating of the R version on the Hadoop nodes. Jeffry Breen wrote a blog post about using Segue. Take a look at that and see if it's easier to use.
edit:
I should at least give a few tips on debugging R code in EMR/Hadoop streaming:
1) Debugging R code from the Hadoop logs is damn near impossible. In my experience, I really have to set up an EMR cluster, log into it, and manually run the code from within R. This requires starting the cluster with a key defined. I generally do this on a single node cluster for debugging and using a very small data set. No sense running up multiple nodes just to debug.
2) Running the job interactively within R on the EMR node requires having any input files in the /home/hadoop/ directory on the Hadoop node. The easiest way to do that is the scp any files you need up to the cluster.
3) Prior to doing 1 & 2, test your code locally using the same method
4) Once you think the R code works, you should be able to do this on your Hadoop machine
cat numberList.txt | ./mapper.R | sort | ./reducer.R
and it should run. If you are not using a mapper or reducer, they can be replaced with cat. I use numberList.txt in this example because in my code on github that is the input file name.
Upvotes: 6