Reputation: 518

S3 download is too slow when concurrent instance download?

I have a system that processes big data sets and downloads data from an S3 bucket.

Each instance downloads multiple objects from inside an object (dir) on S3. When the number of instances are less, the download speeds are good i.e. 4-8MiB/s. But when I use like 100-300 instances the download speed reduce to 80KiB/s.

Wondering what might be the reasons behind it and what ways can I use to remedy it?

Upvotes: 4

Answers (3)

Matt Houser

Reputation: 36073

If your EC2 instances are in private subnets, then your NAT may be a limiting factor.

Try the following:

Add S3 endpoints to your VPC. This bypasses your NAT when your EC2 instances talk to S3.
If you are using NAT instances, try using NAT gateways instead. They can scale up/down the bandwidth.
If you are using a NAT instance, try increasing the instance type of your NAT instance to one with more CPU and Enhanced Networking.
If you are using a single NAT, try using multiple NATs instead (one per subnet). This will spread the bandwidth across multiple NATs.
If all that fails, try putting your EC2 instances into public subnets.

Upvotes: 8

Julio Faerman

Reputation: 13501

You probably want to use S3DistCP instead of managing concurrency and connections by hand...

Upvotes: 0

Brian Ecker

Reputation: 2077

How are the objects in your S3 bucket named? The naming of the objects can have a surprisingly large effect on the throughput of the bucket due to partitioning. In the background, S3 partitions your bucket based on the keys of the objects, but only the first 3-4 characters of the key are really important. Also note that the key is the entire path in the bucket, but the subpaths don't matter for partitioning. So if you have a bucket called mybucket and you have objects inside like 2017/july/22.log, 2017/july/23.log, 2017/june/1.log, 2017/oct/23.log then the fact that you've partitioned by month doesn't actually matter because only the first few characters of the entire key are used.

If you have a sequential naming structure for the objects in your bucket, then you will likely have bad performance with many parallel requests for objects. In order to get around this, you should assign a random prefix of 3-4 characters to each object in the bucket.

See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.

Upvotes: 4

S3 download is too slow when concurrent instance download?

Answers (3)

Related Questions