Reputation: 563
I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ?
Command:
./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users
Output
Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to copy 1 files: s3://odsh/input/GL_01112_20121019.dat etc at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:538) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249)
Upvotes: 7
Views: 9162
Reputation: 91
The problem is the map - reduce jobs fail. Mapper execute perfectly but reducers create a bottle neck at the clusters memory.
THIS SOLVED for me -Dmapreduce.job.reduces=30 if it still fails try to
reduce it to 20 i.e. -Dmapreduce.job.reduces=20
I'll add the entire argument for ease of understanding:
In AWS Cluster:
JAR location : command-runner.jar
Main class : None
Arguments : s3-dist-cp -Dmapreduce.job.reduces=30 --src=hdfs:///user/ec2-user/riskmodel-output --dest=s3://dev-quant-risk-model/2019_03_30_SOM_EZ_23Factors_Constrained_CSR_Stats/output --multipartUploadChunkSize=1000
Action on failure: Continue
in script file:
aws --profile $AWS_PROFILE emr add-steps --cluster-id $CLUSTER_ID --steps Type=CUSTOM_JAR,Jar='command-runner.jar',Name="Copy Model Output To S3",ActionOnFailure=CONTINUE,Args=[s3-dist-cp,-Dmapreduce.job.reduces=20,--src=$OUTPUT_BUCKET,--dest=$S3_OUTPUT_LARGEBUCKET,--multipartUploadChunkSize=1000]
Upvotes: 2
Reputation: 1569
Adjusting the number of workers didn't work for me; s3distcp always failed on a small/medium instance. Increasing the heap size of the task job (via -D mapred.child.java.opts=-Xmx1024m
) solved it for me.
Example usage:
hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar
-D mapred.child.java.opts=-Xmx1024m
--src s3://source/
--dest hdfs:///dest/ --targetSize 128
--groupBy '.*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..*'
--outputCodec gzip
Upvotes: 2
Reputation: 31
I see this same problem caused by race condition. Passing -Ds3DistCp.copyfiles.mapper.numWorkers=1
helps avoid the problem.
I hope Amazon fixes this bug.
Upvotes: 3
Reputation: 305
I'm getting the same exception. It looks like the bug is caused by a race condition when CopyFilesReducer
uses multiple CopyFilesRunable
instances to download the files from S3. The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. Hence, when one thread completes before another it deletes the temp directory that another thread is still using.
I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers
to 1 in your job config.
Upvotes: 7