Reputation: 1339
I have an issue while uploading a large file (larger than 5GB) from HDFS to S3. Is there a way to upload the file directly from HDFS to S3 without downloading it to the local file system and using multipart ?
Upvotes: 1
Views: 3154
Reputation: 13430
If you are using Hadoop 2.7.1 or later, use the s3a:// filesystem to talk to S3. It supports multi-part uploads, which is what you need here.
Update: September 2016
I should add that we are reworking the S3A output stream work for Hadoop 2.8; the current one buffers multipart uploads in the Heap, and falls over when you are generating bulk data faster than your network can push to s3.
Upvotes: 3
Reputation: 6343
For copying data between HDFS and S3, you should use s3DistCp
. s3DistCp
is optimized for AWS and does an efficient copy of large number of files in parallel across S3 buckets.
For usage of s3DistCp
, you can refer the document here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
The code for s3DistCp
is available here: https://github.com/libin/s3distcp
Upvotes: 3