I have been using AWS Athena to query analytics data stored on S3 across several tables. Over a period of time I have come up with 2-3 complex SQL queries (involving several joins) for pulling relevant data. Since, Athena is for ad-hoc queries (and not predefined queries), besides prohibitive costs for processing several TB and 30 minute timeout, I am looking for alternatives. Two alternatives that I can think of are: Use Presto based EMR cluster and run existing query. It removes the 30 minute limit and (might) reduce costs ($5/TB). However, the cons are reprocessing the same data on successive runs. Do ETL (such as through AWS Glue) and denormalise data. This should reduce repeated joins, as only incremental data is processed. Subsequently query the flattened data with some SQL interface - Athena/Hive. However, I am not sure if denormalisation is a good idea, besides the cost of storing redundant (huge) data. Which of these is a better choice or is there a better standard technique for this issue?

amazon-web-servicesamazon-athenaprestoaws-glue

Reputation: 616

Is AWS Glue + Athena/Hive right choice to replace complex SQL queries?

I have been using AWS Athena to query analytics data stored on S3 across several tables. Over a period of time I have come up with 2-3 complex SQL queries (involving several joins) for pulling relevant data. Since, Athena is for ad-hoc queries (and not predefined queries), besides prohibitive costs for processing several TB and 30 minute timeout, I am looking for alternatives.

Two alternatives that I can think of are:

Use Presto based EMR cluster and run existing query. It removes the 30 minute limit and (might) reduce costs ($5/TB). However, the cons are reprocessing the same data on successive runs.
Do ETL (such as through AWS Glue) and denormalise data. This should reduce repeated joins, as only incremental data is processed. Subsequently query the flattened data with some SQL interface - Athena/Hive. However, I am not sure if denormalisation is a good idea, besides the cost of storing redundant (huge) data.

Which of these is a better choice or is there a better standard technique for this issue?

Upvotes: 4

Answers (2)

Kamil Bajda-Pawlikowski

Reputation: 602

I think it's best to do 2 (denormalization) and then 1 (run Presto over the optimized data layout). Also, Presto with Cost-Based Optimizer might be worth a look: https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/

Upvotes: 2

Kishore Bharathy

Reputation: 441

Denormalization of the Data depends on your use case but mostly preferred for s3/hdfs structures. you can follow this link for better Athena storing and performance:

https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Upvotes: 0

Is AWS Glue + Athena/Hive right choice to replace complex SQL queries?

Answers (2)

Related Questions