Yahia
Yahia

Reputation: 1339

How to upload large files from HDFS to S3

I have an issue while uploading a large file (larger than 5GB) from HDFS to S3. Is there a way to upload the file directly from HDFS to S3 without downloading it to the local file system and using multipart ?

Upvotes: 1

Views: 3154

Answers (2)

stevel
stevel

Reputation: 13430

If you are using Hadoop 2.7.1 or later, use the s3a:// filesystem to talk to S3. It supports multi-part uploads, which is what you need here.

Update: September 2016

I should add that we are reworking the S3A output stream work for Hadoop 2.8; the current one buffers multipart uploads in the Heap, and falls over when you are generating bulk data faster than your network can push to s3.

Upvotes: 3

Manjunath Ballur
Manjunath Ballur

Reputation: 6343

For copying data between HDFS and S3, you should use s3DistCp. s3DistCp is optimized for AWS and does an efficient copy of large number of files in parallel across S3 buckets.

For usage of s3DistCp, you can refer the document here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

The code for s3DistCp is available here: https://github.com/libin/s3distcp

Upvotes: 3

Related Questions