Reputation: 1446

Add a random prefix to the key names to improve S3 performance?

You expect this bucket to immediately receive over 150 PUT requests per second. What should the company do to ensure optimal performance?

A) Amazon S3 will automatically manage performance at this scale.

B) Add a random prefix to the key names.

The correct answer was B and I'm trying to figure out why that is. Can someone please explain the significance of B and if it's still true?

Upvotes: 25

Answers (7)

Julien Laurenceau

Reputation: 340

@tagar That's true especially if you are not in a read intensive scenario ! You have to understand the small characters of the doc to reverse engineer how it is working internally and how your are limited by the system. There is no magic !

503 Slow Down errors are emitted typically when a single shard of S3 is in a hot spot scenario : too much requests to a single shard. What is difficult to understand is how sharding is done internally and that the advertised limit of request is not guaranteed.

pre-2018 behavior gives the details : it was advised to start the first 6-8 digits of the prefix with random characters to avoid hot spots.
One can them assume that initial sharding of an S3 bucket is done based on the first 8 digits of the prefix.
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

post-2018 : an automatic sharding was put in place and AWS does no longer advise to bother about the first digits of the prefix... However from this doc :

http-5xx-errors-s3

amazon-s3-performance-tips-fb76daae65cb

One can understand that this automatic shard rebalancing can only work well if load to a prefix is PROGRESSIVELY scaled up to advertised limits:

If the request rate on the prefixes increases gradually, Amazon S3
scales up to handle requests for each of the two prefixes. (S3 will
scale up to handle 3,500 PUT/POST/DELETE or 5,500 GET requests per
second.) As a result, the overall request rate handled by the bucket
doubles.

From my experience 503 can appear way before the advertised levels and there is no guarantee on the speed of the internal rebalancing made internally by S3. If you are in a write intensive scenario for exemple uploading a lot of small objects, the automatic scaling won't be efficient to rebalance your load.

In short : if you are relying on S3 performance I advise to stick to pre-2018 rules so that the initial sharding of your storage works immediately and does not rely on the auto-rebalancing algorithm of S3.

hash first 6 digits of prefix or design a datamodel balancing partitions uniformly across first 6 digits of prefix
avoid small objects (target size of object ~128MB)

Upvotes: 4

ABoringAI

Reputation: 109

As of June 2021.

As mentioned on AWS guidebook Best practice design pattern: optimizing Amazon S3 performance, the application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.

I think the random prefix will help to scale S3 performance. for example, if we have 10 prefixes in one S3 bucket, it will have up to 35000 put/copy/post/delete requests and 55000 read requests.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Upvotes: 1

Tagar

Reputation: 14891

S3 prefixes used to be determined by the first 6-8 characters;

This has changed mid-2018 - see announcement https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

But that is half-truth. Actually prefixes (in old definition) still matter.

S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.

Learning new access patterns takes time. Repartitioning of the data takes time.

Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).

TLDR:

"Old prefixes" still do matter. Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)

"New prefixes" do work, but not initially. It takes time to accommodate to load.

PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.

Upvotes: 6

Srujan Bangaru

Reputation: 341

B is correct because, when you add randomness (called entropy or some disorderness), that can place all the objects locat close to each other in the same partition in an index.(for example, a key prefixed with the current year) When your application experiences an increase in traffic, it will be trying to read from the same section of the index, resulting in decreased performance.So, app devs add some random prefixes to avoid this. Note: AWS might have taken care of this so Dev won't need to take care but just wanted to attempt to give the correct answer for the question asked.

Upvotes: 1

Dinesh Varma Indukuri

Reputation: 3

How to introduce randomness to S3 ?

Prefix folder names with random hex hashes. For example: s3://BUCKET/23a6-FOLDERNAME/FILENAME.zip
Prefix file names with timestamps. For example: s3://BUCKET/ FOLDERNAME/2013-26-05-15-00-00-FILENAME.zip

Upvotes: 1