Reputation: 518
I have a system that processes big data sets and downloads data from an S3 bucket.
Each instance downloads multiple objects from inside an object (dir) on S3. When the number of instances are less, the download speeds are good i.e. 4-8MiB/s
.
But when I use like 100-300
instances the download speed reduce to 80KiB/s
.
Wondering what might be the reasons behind it and what ways can I use to remedy it?
Upvotes: 4
Views: 14247
Reputation: 36073
If your EC2 instances are in private subnets, then your NAT may be a limiting factor.
Try the following:
Upvotes: 8
Reputation: 13501
You probably want to use S3DistCP instead of managing concurrency and connections by hand...
Upvotes: 0
Reputation: 2077
How are the objects in your S3 bucket named? The naming of the objects can have a surprisingly large effect on the throughput of the bucket due to partitioning. In the background, S3 partitions your bucket based on the keys of the objects, but only the first 3-4 characters of the key are really important. Also note that the key is the entire path in the bucket, but the subpaths don't matter for partitioning. So if you have a bucket called mybucket
and you have objects inside like 2017/july/22.log
, 2017/july/23.log
, 2017/june/1.log
, 2017/oct/23.log
then the fact that you've partitioned by month doesn't actually matter because only the first few characters of the entire key are used.
If you have a sequential naming structure for the objects in your bucket, then you will likely have bad performance with many parallel requests for objects. In order to get around this, you should assign a random prefix of 3-4 characters to each object in the bucket.
See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.
Upvotes: 4