Arno
Arno

Reputation: 81

Best way to partition AWS Athena tables for querying S3 data with high cardinality

We have a bucket in S3 where we store thousands of records every day (we end up having many GBs of data that keep increasing) and we want to be able to run Athena queries on them.

The data in S3 is stored in patterns like this:S3://bucket/Category/Subcategory/file. There are multiple categories (more than 100) and each category has 1-20 subcategories. All the files we store in S3 (in apache parquet format) include sensor readings. There are categories with millions of sensor readings (sensors send thousands per day) and categories with just a few hundreds of readings (sensors send on average a few readings per month), so the data is not split evenly across categories. A reading includes a timestamp, a sensorid and a value among other things.

We want to run Athena queries on this bucket's objects, based on date and sensorid with the lowest cost possible. e.g.: Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.

What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries? We have the freedom to save one reading per file - resulting in millions of files (be able to easily partition per sensorid or date but what about performance if we have millions of files per day?) or multiple readings per file (much less files but not able to directly partition per sensor id or date because not all readings in a file are from the same sensor and we need to save them in the order they arrive). Is Athena a good solution for our case or is there a better alternative?

Any insight would be helpful. Thank you in advance

Upvotes: 5

Views: 2278

Answers (1)

Bruno Reis
Bruno Reis

Reputation: 37832

Some comments.

Is Athena a good solution for our case or is there a better alternative?

Athena is great when you don't need or want to set up a more sophisticated big data pipeline: you simply put (or already have) your data in S3, and you can start querying it immediately. If that's enough for you, then Athena may be enough for you.

Here are few things that are important to consider to properly answer that specific question:

  • How often are you querying? (i.e., is it worth have some sort of big data cluster running non-stop like an EMR cluster? or is it better to just pay when you query, even if it means that per query your cost could end up higher?)

  • How much flexibility do you want when processing the dataset? (i.e., does Athena offer all the capabilities you need?)

  • What are all the data stores that you may want to query "together"? (i.e., is and will all the data be in S3? or do you or will you have data in other services such as DynamoDB, Redshift, EMR, etc...?)

Note that none of these answers would necessarily say "don't use Athena" — they may just suggest what kind of path you may want to follow going forward. In any case, since your data is in S3 already, in a format suitable for Athena, and you want to start querying it already, Athena is a very good choice right now.

Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.

In both examples, you are filtering by category. This suggests that partitioning by category may be a good idea (whether you're using Athena or not!). You're doing that already, by having /Category/ as part of the objects' keys in S3.

One way to identify good candidates for partitioning schemes is to think about all the queries (at least the most common ones) that you're going to run, and check the filters by equality or the groups that they're doing. E.g., thinking in terms of SQL, if you often have queries with WHERE XXX = ?.

Maybe you have many more different types of queries, but I couldn't help but notice that both your examples had filters on category, thus it feels "natural" to partition by category (like you did).

Feel free to add a comment with other examples of common queries if that was just some coincidence and filtering by category is not as important/common as the examples suggest.

What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries?

There's hardly a single (i.e., best) answer here. It's always a trade-off based on lots of characteristics of the data set (structure; size; number of records; growth; etc) and the access patterns (proportion of reads and writes; kinds of writes, e.g. append-only, updates, removals, etc; presence of common filters among a large proportion of queries; which queries you're willing to sacrifice in order to optimize others; etc).

Here's some general guidance (not only for Athena, but in general, in case you decide you may need something other than Athena).

There are two very important things to focus on to optimize a big data environment:

  1. I/O is slow.

  2. Spread work evenly across all "processing units" you have, ideally fully utilizing each of them.

Here's why they matter.

First, for a lot of "real world access patterns", I/O is the bottleneck: reading from storage is many orders of magnitude slower than filtering a record in the CPU. So try to focus on reducing the amount of I/O. This means both reducing the volume of data read as well as reducing the number of individual I/O operations.

Second, if you end up with uneven distribution of work across multiple workers, it may happen that some workers finish quickly but other works take much longer, and their work cannot be divided further. This is also a very common issue. In this case, you'll have to wait for the slowest worker to complete before you can get your results. When you ensure that all workers are doing an equivalent amount of work, they'll all be working at near 100% and they'll all finish approximately at the same time. This way, you don't have to keep waiting longer for the slower ones.

Things to have in mind to help with those goals:

  • Avoid too big and too small files.

If you have a huge number of tiny files, then your analytics system will have to issue a huge number of I/O operations to retrieve data. This hurts performance (and, in case of S3, in which you pay per request, can dramatically increase cost).

If you have a small number of huge files, depending on the characteristics of the file format and the worker units, you may end up not being able to parallelize work too much, which can cause performance to suffer.

Try to keep the file sizes uniform, so that you don't end up with a worker unit finishing too quickly and then idling (may be an issue in some querying systems, but not in others).

Keeping files in the range of "a few GB per file" is usually a good choice.

  • Use compression (and prefer splittable compression algos).

Compressing files greatly improves performance because it reduces I/O tremendously: most "real world" datasets have a lot of common patterns, thus are highly compressible. When data is compressed, the analytics system spends less time reading from storage — and the "extra CPU time" spent to decompress the data before it can truly be queried is negligible compared to the time saved on reading form storage.

Keep in mind that there are some compression algorithms that are non-splittable: it means that one must start from the beginning of the compressed stream to access some bytes in the middle. When using a splittable compressions algorithm, it's possible to start decompressing from multiple positions in the file. There are multiple benefits, including that (1) an analytics system may be able to skip large portions of the compressed file and only read what matters, and (2) multiple workers may be able to work on the same file simultaneously, as they can each access different parts of the file without having to go over the entire thing from the beginning.

Notably, gzip is non-splittable (but since you mention Parquet specifically, keep in mind that the Parquet format may use gzip internally, and may compress multiple parts independently and just combine them into one Parquet file, leading to a structure that is splittable; in other words: read the specifics about the format you're using and check if it's splittable).

  • Use columnar storage.

That is, storing data "per columns" rather than "per rows". This way, a single large I/O operation will retrieve a lot of data for the column you need rather than retrieving all the columns for a few records and then discarding the unnecessary columns (reading unnecessary data hurts performance tremendously).

Not only you reduce the volume of data read from storage, you also improve how fast a CPU can process that data, since you'll have lots of pages of memory with useful data, and the CPU has a very simple set of operations to perform — this can dramatically improve performance at the CPU level.

Also, by keeping data organized by columns, you generally achieve better compression, leading to even less I/O.

You mention Parquet, so this is taken care of. If you ever want to change it, remember about using columnar storage.

  • Think about queries you need in order to decide about partitioning scheme.

Like in the example above about the category filtering, that was present in both queries you gave as examples.

When you partition like in the example above, you are greatly reducing I/O: the querying system will know exactly which files it needs to retrieve, and will avoid having to reading the entire dataset.

There you go.

These are just some high-level guidance. For more specific guidance, it would be necessary to know more about your dataset, but this should at least get you started in asking yourself the right questions.

Upvotes: 7

Related Questions