raghuram gururajan
raghuram gururajan

Reputation: 563

copy files from amazon s3 to hdfs using s3distcp fails

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ?

Command:

./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users

Output

Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to copy 1 files: s3://odsh/input/GL_01112_20121019.dat etc at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:538) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249)

Upvotes: 7

Views: 9162

Answers (4)

avinash nahar
avinash nahar

Reputation: 91

The problem is the map - reduce jobs fail. Mapper execute perfectly but reducers create a bottle neck at the clusters memory.

THIS SOLVED for me -Dmapreduce.job.reduces=30 if it still fails try to

reduce it to 20 i.e. -Dmapreduce.job.reduces=20

I'll add the entire argument for ease of understanding:

In AWS Cluster:

JAR location : command-runner.jar

Main class : None

Arguments : s3-dist-cp -Dmapreduce.job.reduces=30 --src=hdfs:///user/ec2-user/riskmodel-output --dest=s3://dev-quant-risk-model/2019_03_30_SOM_EZ_23Factors_Constrained_CSR_Stats/output --multipartUploadChunkSize=1000

Action on failure: Continue

in script file:

aws --profile $AWS_PROFILE emr add-steps --cluster-id $CLUSTER_ID --steps Type=CUSTOM_JAR,Jar='command-runner.jar',Name="Copy Model Output To S3",ActionOnFailure=CONTINUE,Args=[s3-dist-cp,-Dmapreduce.job.reduces=20,--src=$OUTPUT_BUCKET,--dest=$S3_OUTPUT_LARGEBUCKET,--multipartUploadChunkSize=1000]

Upvotes: 2

erikreed
erikreed

Reputation: 1569

Adjusting the number of workers didn't work for me; s3distcp always failed on a small/medium instance. Increasing the heap size of the task job (via -D mapred.child.java.opts=-Xmx1024m) solved it for me.

Example usage:

hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar 
    -D mapred.child.java.opts=-Xmx1024m 
    --src s3://source/
    --dest hdfs:///dest/ --targetSize 128
    --groupBy '.*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..*' 
    --outputCodec gzip

Upvotes: 2

user3833204
user3833204

Reputation: 31

I see this same problem caused by race condition. Passing -Ds3DistCp.copyfiles.mapper.numWorkers=1 helps avoid the problem.

I hope Amazon fixes this bug.

Upvotes: 3

user1995521
user1995521

Reputation: 305

I'm getting the same exception. It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files from S3. The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. Hence, when one thread completes before another it deletes the temp directory that another thread is still using.

I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers to 1 in your job config.

Upvotes: 7

Related Questions