Reputation: 371
I am trying to move data from HDFS to S3 using distcp
. The distcp
job seems to succeed, but on S3 the files are not being created correctly. There are two issues:
block_<some number>
at the root of the bucket.I could not find any documentation/examples for this. What am I missing? How can I debug?
Here are some more details:
$ hadoop version
Hadoop 0.20.2-cdh3u0
Subversion -r
Compiled by diego on Sun May 1 15:42:11 PDT 2011
From source with checksum
hadoop fs –ls hdfs://hadoopmaster/data/paramesh/
…<bunch of files>…
hadoop distcp hdfs://hadoopmaster/data/paramesh/ s3://<id>:<key>@paramesh-test/
$ ./s3cmd-1.1.0-beta3/s3cmd ls s3://paramesh-test
DIR s3://paramesh-test//
DIR s3://paramesh-test/test/
2012-05-10 02:20 0 s3://paramesh-test/block_-1067032400066050484
2012-05-10 02:20 8953 s3://paramesh-test/block_-183772151151054731
2012-05-10 02:20 11209 s3://paramesh-test/block_-2049242382445148749
2012-05-10 01:40 1916 s3://paramesh-test/block_-5404926129840434651
2012-05-10 01:40 8953 s3://paramesh-test/block_-6515202635859543492
2012-05-10 02:20 48051 s3://paramesh-test/block_1132982570595970987
2012-05-10 01:40 48052 s3://paramesh-test/block_3632190765594848890
2012-05-10 02:20 1160 s3://paramesh-test/block_363439138801598558
2012-05-10 01:40 1160 s3://paramesh-test/block_3786390805575657892
2012-05-10 01:40 11876 s3://paramesh-test/block_4393980661686993969
Upvotes: 9
Views: 20969
Reputation: 13430
Updating this for Apache Hadoop 2.7+, and ignoring Amazon EMR as they've changed things there.
Improving distcp reliability and performance working with object stores is still and ongoing piece of work...contributions are, as always, welcome.
Upvotes: 2
Reputation: 2475
In the event that your files in HDFS are larger than 5GB, you will encounter errors in your distcp job that look like:
Caused by: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 400, ResponseStatus: Bad Request, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>23472570134</ProposedSize><MaxSizeAllowed>5368709120</MaxSizeAllowed><RequestId>5BDA6B12B9E99CE9</RequestId><HostId>vmWvS3Ynp35bpIi7IjB7mv1waJSBu5gfrqF9U2JzUYsXg0L7/sy42liEO4m0+lh8V6CqU7PU2uo=</HostId></Error> at org.jets3t.service.S3Service.putObject(S3Service.java:2267) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:122) ... 27 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
To fix this, use either the s3n
filesystem as @matthew-rathbone suggested, but with -Dfs.s3n.multipart.uploads.enabled=true
like:
hadoop distcp -Dfs.s3n.multipart.uploads.enabled=true hdfs://file/1 s3n://bucket/destination
OR
Use the "next generation" s3 filesystem, s3a
like:
hadoop distcp -Dfs.s3a.endpoint=apigateway.us-east-1.amazonaws.com hdfs://file/1 s3a://bucket/destination
Options and documentation for these live here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
Upvotes: 3
Reputation: 131
Amazon has created a version of distcp that is optimized for transferring between hdfs and s3 which they call, appropriately, s3distcp. You may want to check that out as well. It is intended for use with Amazon EMR, but the jar is available in s3, so you might be able to use it outside of an EMR job flow.
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Upvotes: 3
Reputation: 8259
You should use s3n instead of s3.
s3n is the native file system implementation (ie - regular files), using s3 imposes hdfs block structure on the files so you can't really read them without going through hdfs libraries.
Thus:
hadoop distcp hdfs://file/1 s3n://bucket/destination
Upvotes: 15