In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time. I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query. Sharing one of the queries that we run, along with the Query Plan. The query is taking 20 seconds to execute. Query SELECT date_trunc('day', ti) as date, count(distinct deviceID) AS COUNT FROM live_events WHERE brandID = 3927 AND ti >= '2017-08-02T00:00:00+00:00' AND ti <= '2017-09-02T00:00:00+00:00' GROUP BY 1 Primary key brandID Interleaved Sort Keys we have set following columns as interleaved sort keys - brandID, ti, event_name QUERY PLAN

Redshift Query taking too much time

Reputation: 2951

I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.

-- This date operation is expensive
date_trunc('day', ti) as date

If you have the luxury you should store the date in the format you need in an additional column.

Upvotes: 0

sreealekhya

Reputation: 1

Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/

Upvotes: 0

John Rotenstein

Reputation: 269370

You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.

Here's some ways you could improve the performance:

More nodes

Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.

SORTKEY

For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.

For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.

Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.

SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.

The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.

DISTKEY

The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.

Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.

VACUUM

VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.

Experiment!

Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.

Upvotes: 6

Redshift Query taking too much time

Answers (3)

Related Questions