Paramesh
Paramesh

Reputation: 371

Problems with Hadoop distcp from HDFS to Amazon S3

I am trying to move data from HDFS to S3 using distcp. The distcp job seems to succeed, but on S3 the files are not being created correctly. There are two issues:

  1. The file names and paths are not replicated. All files end up as block_<some number> at the root of the bucket.
  2. It creates bunch of extra files on S3 with some meta data and logs.

I could not find any documentation/examples for this. What am I missing? How can I debug?

Here are some more details:

$ hadoop version 
Hadoop 0.20.2-cdh3u0
Subversion  -r 
Compiled by diego on Sun May  1 15:42:11 PDT 2011
From source with checksum 
hadoop fs –ls hdfs://hadoopmaster/data/paramesh/
…<bunch of files>…

hadoop distcp  hdfs://hadoopmaster/data/paramesh/ s3://<id>:<key>@paramesh-test/
$ ./s3cmd-1.1.0-beta3/s3cmd ls s3://paramesh-test

                       DIR   s3://paramesh-test//
                       DIR   s3://paramesh-test/test/
2012-05-10 02:20         0   s3://paramesh-test/block_-1067032400066050484
2012-05-10 02:20      8953   s3://paramesh-test/block_-183772151151054731
2012-05-10 02:20     11209   s3://paramesh-test/block_-2049242382445148749
2012-05-10 01:40      1916   s3://paramesh-test/block_-5404926129840434651
2012-05-10 01:40      8953   s3://paramesh-test/block_-6515202635859543492
2012-05-10 02:20     48051   s3://paramesh-test/block_1132982570595970987
2012-05-10 01:40     48052   s3://paramesh-test/block_3632190765594848890
2012-05-10 02:20      1160   s3://paramesh-test/block_363439138801598558
2012-05-10 01:40      1160   s3://paramesh-test/block_3786390805575657892
2012-05-10 01:40     11876   s3://paramesh-test/block_4393980661686993969

Upvotes: 9

Views: 20969

Answers (4)

stevel
stevel

Reputation: 13430

Updating this for Apache Hadoop 2.7+, and ignoring Amazon EMR as they've changed things there.

  1. If you are using Hadoop 2.7 or later, use s3a over s3n. This also applies to recent versions of HDP and, AFAIK, CDH.
  2. This supports 5+GB files, has other nice features, etc. It is tangibly better when reading files —and will only get better over time.
  3. Apache s3:// should be considered deprecated -you don't need it any more, and shouldn't be using it.
  4. Amazon EMR use "s3://" to refer to their own, custom, binding to S3. That's what you should be using if you are running on EMR.

Improving distcp reliability and performance working with object stores is still and ongoing piece of work...contributions are, as always, welcome.

Upvotes: 2

Sean
Sean

Reputation: 2475

In the event that your files in HDFS are larger than 5GB, you will encounter errors in your distcp job that look like:

Caused by: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 400, ResponseStatus: Bad Request, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>23472570134</ProposedSize><MaxSizeAllowed>5368709120</MaxSizeAllowed><RequestId>5BDA6B12B9E99CE9</RequestId><HostId>vmWvS3Ynp35bpIi7IjB7mv1waJSBu5gfrqF9U2JzUYsXg0L7/sy42liEO4m0+lh8V6CqU7PU2uo=</HostId></Error> at org.jets3t.service.S3Service.putObject(S3Service.java:2267) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:122) ... 27 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

To fix this, use either the s3n filesystem as @matthew-rathbone suggested, but with -Dfs.s3n.multipart.uploads.enabled=true like:

hadoop distcp -Dfs.s3n.multipart.uploads.enabled=true hdfs://file/1 s3n://bucket/destination

OR

Use the "next generation" s3 filesystem, s3a like:

hadoop distcp -Dfs.s3a.endpoint=apigateway.us-east-1.amazonaws.com hdfs://file/1 s3a://bucket/destination

Options and documentation for these live here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

Upvotes: 3

dcsesq
dcsesq

Reputation: 131

Amazon has created a version of distcp that is optimized for transferring between hdfs and s3 which they call, appropriately, s3distcp. You may want to check that out as well. It is intended for use with Amazon EMR, but the jar is available in s3, so you might be able to use it outside of an EMR job flow.

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

Upvotes: 3

Matthew Rathbone
Matthew Rathbone

Reputation: 8259

You should use s3n instead of s3.

s3n is the native file system implementation (ie - regular files), using s3 imposes hdfs block structure on the files so you can't really read them without going through hdfs libraries.

Thus:

hadoop distcp hdfs://file/1 s3n://bucket/destination

Upvotes: 15

Related Questions