Reputation: 11

Aws Athena different amount of data scanned but same response time

I have 2 athena queries that are exactly the same. but one selects only a day and the other selects a year.

Note: The info is in parquet format on S3 Bucket (HIVE partitioned. year/month/day/hour).

This are the benchmarks:

QUERY 1 (one day): Run time: 1 minute 5 seconds, Data scanned: 13.96 MB

QUERY 2 (one year): (Run time: 1 minute 19 seconds, Data scanned: 49.57 MB)

Why the time is almost the same, if the amount of data scanned is a lot different?

Upvotes: 1

Answers (3)

Mahdi Karabiben

Reputation: 26

Since you're using year/month/day/hour partitioning, you probably have thousands of partitions. In that case, if you don't use string data type for your partitions, Athena wastes a lot of time in the QueryPlanningTimeInMillis step mentioned by @Theo.

That happens because when you're using any data type other than string for your partition columns, Athena prunes partitions on the server side (which takes a lot of time).

To make your query much more efficient, use string data type for all of your partition columns (even if they're technically integers). That way, Athena prunes partitions at the metastore level - which in turn reduces overhead and prevents queries from timing out.

Source: Considerations and Limitations for SQL Queries in Amazon Athena (AWS Docs)

Upvotes: 1

Theo

Reputation: 132922

Check the query statistics returned by the GetQueryExecution API call (e.g. aws athena get-query-execution --query-execution-id ID --region REGION), this will break down the total execution time in these components:

QueryQueueTimeInMillis: the amount of time the query was queued before executing
QueryPlanningTimeInMillis: the amount of time it took to plan the query, including listing partitions on S3
EngineExecutionTimeInMillis: the amount of time spend running the query (N.B. this includes the QueryPlanningTimeInMillis)
ServiceProcessingTimeInMillis: the amount of time spent writing the results to S3
TotalExecutionTimeInMillis: the total amount of time spent executing the query (includes all of the above, but doesn't always match exactly)

To compare executions you need to compare the right thing. The reason for your queries taking roughly the same amount of time could be that they spent most of the time queueing, or that they spent most of the listing objects on S3 during planning, or that they spent most of the time writing the results to S3 – or it could be, like Piotr says, that all these numbers are more or less identical and Athena could use four times the capacity for the bigger query and that the few extra seconds the bigger query requires is the overhead of listing more partitions during planning, and/or writing a bigger result to S3.

Upvotes: 1

Piotr Findeisen

Reputation: 20770

Presto, which Athena is based on, is MPP -- massively parallel processing engine. If there is more data scanned, it can be scanned with higher parallelism, so you can scan more data within same wall time.

Presto UI (do not know about Athena) provides metrics like: query wall time, query CPU time and overall parallelism, which show that nicely.

Upvotes: 2

Aws Athena different amount of data scanned but same response time

Answers (3)

Related Questions