Amazon AWS S3 Glacier: is there a file hierarchy

Question

Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?

For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt

I am storing multiple .tar.gz files, and would like:

All files in the same Vault
Within the Vault, files are grouped into several categories (as opposed to flat structure)

I could not find how to do this documented anywhere. If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?

Bruno Reis · Accepted Answer

Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?

No, there's no hierarchy other than "archives exist inside a vault".

For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt

This is actually incorrect.

S3 doesn't have any inherent hierarchy. The character / is absolutely no different than any other character valid for the key of an S3 Object.

The S3 Console — and most S3 client tools, including AWS's CLI — treat the / character in a special way. But notice that it is a client-side thing. The client will make sure that listing happens in such a way that a / behaves as most people would expect, that is, as a "hierarchy separator".

If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?

You need to keep track of your hierarchy separately. For example, when you store an archive in Glacier, you could write metadata about that archive in a database (RDS, DynamoDB, etc).

As a side note, be careful about .tar.gz in Glacier, especially if you're talking about (1) a very large archive (2) that is composed of a large number of small individual files (3) which you may want to access individually.

If those conditions are met (and in my experience they often are in real-world scenarios), then using .tar.gz will often lead to excessive costs when retrieving data.

The reason is because you pay per number of requests as well as per size of request. So while having one huge .tar.gz file may reduce your costs in terms of number of requests, the fact that gzip uses DEFLATE, which is a non-splittable compression algorithm, means that you'll have to retrieve the entire .tar.gz archive, decompress it, and finally get the one file that you actually want.

An alternative approach that solves the problem I described above — and that, at the same time, relates back to your question and my answer — is to actually first gzip the individual files, and then tar them together. The reason this solves the problem is that when you tar the files together, the individual files actually have clear bounds inside the tarball. And then, when you request a retrieval from glacier, you can request only a range of the archive. E.g., you could say, "Glacier, give me bytes between 105MB and 115MB of archive X". That way you can (1) reduce the total number of requests (since you have a single tar file), and (2) reduce the total size of the requests and storage (since you have compressed data).

Now, to know which range you need to retrieve, you'll need to store metadata somewhere — usually the same place where you will keep your hierarchy! (like I mentioned above, RDS, DynamoDB, Elasticsearch, etc).

Anyways, just an optimization that could save a tremendous amount of money in the future (and I've worked with a ton of customers who wasted a lot of money because they didn't know about this).

Amazon AWS S3 Glacier: is there a file hierarchy

Answers (1)

Related Questions