Amir Bar
Amir Bar

Reputation: 3105

amazon s3 name Key Name misunderstaning

From their docs they gave this example of a good implementation:

examplebucket/animations/232a-2013-26-05-15-00-00/cust1234234/animation1.obj examplebucket/animations/7b54-2013-26-05-15-00-00/cust3857422/animation2.obj examplebucket/animations/921c-2013-26-05-15-00-00/cust1248473/animation3.obj examplebucket/videos/ba65-2013-26-05-15-00-00/cust8474937/video2.mpg examplebucket/videos/8761-2013-26-05-15-00-00/cust1248473/video3.mpg examplebucket/videos/2e4f-2013-26-05-15-00-01/cust1248473/video4.mpg examplebucket/videos/9810-2013-26-05-15-00-01/cust1248473/video5.mpg examplebucket/videos/7e34-2013-26-05-15-00-01/cust1248473/video6.mpg examplebucket/videos/c34a-2013-26-05-15-00-01/cust1248473/video7.mpg

I just dont understand how this is a good example for file naming for high performance

If Amazon chooses the first 4 chars for the key, than we got only 2 keys in here which is bad

  1. anim
  2. vide

so what I am missing?

Upvotes: 4

Views: 846

Answers (2)

Michael - sqlbot
Michael - sqlbot

Reputation: 179084

I believe the explanation is here, from the same page:

This example illustrate how Amazon S3 can use the first character of the key name for partitioning, but for very large workloads (more than 2000 requests per seconds or for bucket that contain billions of objects), Amazon S3 can use more characters for the partitioning scheme. Amazon S3 can automatically split these partitions further as the key count and request rate increase over time.

The implication (which is all we can really go on, since the internals of S3 aren't public information) is that, whenever necessary, S3 will automatically split index partitions in response to the workload, in order to reduce hot spots... but if you don't provide an obvious logical "split point" -- such as introduction of some pseudo-randomness at a given point in the keyspace, the algorithm will have nothing on which to base such a split.

Any time values are incrementing somewhat monotonically, there is nothing the algorithm can do to carve one partition into two such that each of them will see approximately equivalent workloads for writes, when objects are created in or nearly in key order.

Randomness at a fixed point gives the algorithm a much clearer target for a split, and apparently that point can be wherever it needs to be in the key, not just at the beginning.

Conversely, with this example, you provide an easy and obvious split point between animations and videos... the first partition point might be right there on the first character, and that might be enough... but if not, then there are obvious split points again at the end of either animations/ or videos/ ... or both. Then either of those partitions could be subsequently split again if necessary to accommodate the amount of traffic you offer.

I would further suggest that this is largely an academic discussion unless you're planning workloads of hundreds of requests per second, sustained. Store your objects with keys created with a useful and meaningful convention, giving appropriate -- but not excessive -- consideration to these guidelines.

Upvotes: 4

lcerezo
lcerezo

Reputation: 31

The keys are actually the entire "path name". There are no directories in s3 keys.

So since the creation/loading is linear, the advice is to create keys with a random name at the start of the key. in your example,

the key name should be c34a-videos/something/something

your second option is to rev the date string at create time. eg

[lcerezo@awstools ~]$ date +%s
1450308881
[lcerezo@awstools ~]$ date +%s|rev
6888030541
[lcerezo@awstools ~]$

remember the / is just a character. it appears as a dir for us meatbags, but it isn't a dir in the sense that the objects are stored in a posix style file hierarchy.

look at the Examples offered here. https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Hope this makes sense.

Upvotes: 0

Related Questions