Daniel Kats
Daniel Kats

Reputation: 5554

Does pseudorandom substring need to be at beginning of key to benefit from S3 partitioning

Based on this resource adding a pseudo-random prefix to an S3 key will increase your GET performance over having a constant prefix.

So a key of the form:

bucket/$randomPrefix-key.txt

Will perform better in GETs than

bucket/$date-key.txt

It also implies that the common prefix portion doesn't matter. From the article:

You can optionally add more prefixes in your key name, before the hash string, to group objects. The following example adds animations/ and videos/ prefixes to the key names.

examplebucket/animations/232a-2013-26-05-15-00-00/cust1234234/animation1.obj examplebucket/animations/7b54-2013-26-05-15-00-00/cust3857422/animation2.obj examplebucket/animations/921c-2013-26-05-15-00-00/cust1248473/animation3.obj examplebucket/videos/ba65-2013-26-05-15-00-00/cust8474937/video2.mpg examplebucket/videos/8761-2013-26-05-15-00-00/cust1248473/video3.mpg examplebucket/videos/2e4f-2013-26-05-15-00-01/cust1248473/video4.mpg examplebucket/videos/9810-2013-26-05-15-00-01/cust1248473/video5.mpg examplebucket/videos/7e34-2013-26-05-15-00-01/cust1248473/video6.mpg examplebucket/videos/c34a-2013-26-05-15-00-01/cust1248473/video7.mpg ...

So a key of the form

bucket/foo/bar/baz/$randomPrefix-key.txt

Will apparently work just as well as (1).

My question: what if the pseudorandom prefix is in the middle of the key? Does that work just as well?

For example:

bucket/foo/bar/baz-$pseudoRandomString-key.txt

Upvotes: 1

Views: 87

Answers (1)

Michael - sqlbot
Michael - sqlbot

Reputation: 179124

Your example is no different than the ones in the documentation, for an important reason: slashes / have no intrinsic meaning to S3.

There are no folders in S3. foo/bar.txt and foo/baz.jpg are not "in the same folder."

Technically, they are just two objects whose keys have a common prefix.

The console displays them in a folder, only for organizational convenience.

Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects.

http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html

Also:

The Amazon S3 data model does not natively support the concept of folders, nor does it provide any APIs for folder-level operations. But the Amazon S3 console supports folders to help you organize your data.

http://docs.aws.amazon.com/AmazonS3/latest/UG/about-using-console.html

Thus the / has no special meaning to the S3 index, and no special meaning relative to the placement of your random prefix.

However, it's important that the characters before the random prefix remain the same, so that partition splits can be accomplished right at the beginning of the random characters.

S3 must be able to split the list of keys beginning with the first random character and find a balance of work to the left of (<) and right of (>=) the split point.

If you have this...

fix/ed/chars/here-then-$random/anything/here

...then S3 says to itself "hmm... it looks like example-bucket/fixed/chars/here-then-* seems to be taking a lot of traffic, but it looks like the next character is always one of 0 1 2 3 4 5 6 7 8 9 a b c d e f and they're pretty well balanced, so I'm going to split it at "8," so that ...then-0* through ...then-7* is in one partition and ...then-8 through ...then-f in another" and #boom, potential performance bottleneck solved.

The partitioning is completely automatic and transparent.

Here's an example of what not to do.

logs/2017-01-23/$random/...
logs/2017-01-24/$random/...
logs/2017-01-25/$random/...

Here, a hot spot develops in a different prefix each day, giving S3 no good options for creating effective partition splits to alleviate any overload. Any split would always end up to the left of (lexically less than) all future uploads, at some point, in this case -- so not an effective split. By contrast, the split, above, puts about half the workload < and the other half >= a split at a single character.


Also worth noting ... if you don't expect a sustained workload > 100 req/sec, at least, this isn't going to give you any benefit at all. Natural randomness in your keyspace may also suffice, and S3 reads can scale essentially indefinately without these optimizations when coupled with CloudFront (and usually faster and often slightly cheaper, since CloudFront bandwidth pricing is slightly lower than S3 in some areas, presumably since it relieves potential Internet congestion from the Internet connections at the S3 regions). When S3 is connected to CloudFront, S3 rates its bandwidth charges at $0.00/GB Out to the Internet, and CloudFront bills that piece, at its rates, instead of S3.

Upvotes: 2

Related Questions