Derrops
Derrops

Reputation: 8117

When would you want to make s3 object keys similiar

So S3 uses the object key in partitioning data, and that you should make your keys with some randomness to distribute workloads across multiple partitions. My question is are there any scenarios in which you would want to have similar keys? And if not, why then would AWS use the key to partition your data instead of randomly partitioning data itself?

I ask this because I see it as an odd design as it makes it easy for developers to make mistakes in their partitioning if they generate keys which have a pattern, but it also prevents developers from creating keys in a logical manner as this would undoubtedly result in a pattern and the data being partitioned incorrectly.

Upvotes: 0

Views: 200

Answers (2)

Michael - sqlbot
Michael - sqlbot

Reputation: 179054

So S3 uses the object key in partitioning data

Wait. Your question seems premised on this assumption, but it isn't correct.

S3 does not use the object key to partition the data. That would indeed, as you suggest, be a very "odd design" (or worse).

S3 uses the object key to partition the index of objects in the bucket -- otherwise the index of objects would be stored in an order that would not support enumerating the object keys in sorted order which would also eliminate the ability to list objects by prefix, or identify common prefixes using delimiters -- or there would need to be a secondary index, which would just compound the potential scaling issue and move the same problem down one level.

The case for similar keys is when you want to find objects with a common prefix (in the same "folder") on demand. Storing log files is an easy example, yyyy/mm/dd/.... Note that when the various services store log files in buckets for you (S3 logs, CloudFront, ELB), the object keys are sequential like this, because the date and time are in the object key.

When S3 does a partition split, only the index is split. The data is already durably stored and doesn't move. The potential performance considerations are related to the performance of the index, not that of the actual storage of the object data.

Upvotes: 1

John Rotenstein
John Rotenstein

Reputation: 269282

You appear to be referring to Request Rate and Performance Considerations - Amazon Simple Storage Service, which states:

The Amazon S3 best practice guidelines in this topic apply only if you are routinely processing 100 or more requests per second. If your typical workload involves only occasional bursts of 100 requests per second and fewer than 800 requests per second, you don't need to follow these guidelines.

This is unlikely to affect most applications, but if applications do have such high traffic, then spreading requests across the keyname space can improve performance.

AWS has not explained why they have designed Amazon S3 in this manner.

Upvotes: 1

Related Questions