Debo
Debo

Reputation: 41

Redshift sortkey and distkey design when there are multiple columns used for querying data

My question is let's suppose I have a table A with 216 columns :

CREATE TABLE test_data
(
    fund_name     varchar(255)
    fund_desc     varchar(255)
    accountId     varchar(255)   
    postdate      date,
    effectiveDate date,
    businessdate  Date,
    activeTill    TimeStamp
) 

Our use case is as follows:

We have below filter criteria for our data distribution:

activeTill will always have value as "9999-09-99,23:09:0' for all active record; accountId count can max go up to 12k distinct values.

Query #1:

select * 
from fund_data 
where postdate between '2024-06-01' and '2024-06-20' 
  and accountId in ('1234', '56789', '0009') 
  and activeTill > sysdate

Query #2:

select * 
from fund_data 
where effectiveDate between '2024-06-01' and '2024-06-20' 
  and accountId in ('1234', '56789', '0009') 
  and activeTill > sysdate

Query #3:

select * 
from fund_data 
where businessdate between '2024-06-01' and '2024-06-20' 
  and accountId in ('1234', '56789', '0009') 
  and activeTill > sysdate
  1. In our current set up, distkey is AUTO and SORT_KEY as (postdate, accountId, sysdate).

  2. The query performance was ok when only postDate was getting used in where clause as it is the 1st sort key.

  3. Now since consumers started using other date fields equally, those queries performing worst.

  4. So, looking for recommendation ,in such cases what can be ideal SORT key and DIST key.

Upvotes: 0

Views: 256

Answers (1)

Bill Weiner
Bill Weiner

Reputation: 11082

This is a common optimization change that people have to go through. How to balance the optimization gains for multiple sort keys. I did a re:Invent talk that included this a few years back which may add some more context - https://www.youtube.com/watch?v=bxfnWTiY7EM&t

This all comes down to optimizing the use of your block metadata. In your example when you sort by postdate first this will use all the resolution of this column before any other column's data is used to order rows. This means that values for postdate are lumped together resulting in metadata values for this column that have small ranges of values and therefore only a few blocks of data need to be read when querying based on this column in the WHERE clause.

However, the values for the other date columns, effectiveDate and businessdate, are going to be scattered all over the table. So when these columns are in the WHERE clause the entire table's data needs to be read from disk. This is likely what is causing a significant portion of the slowdown you are experiencing for these queries. Check the scan steps for all 3 types of queries and note the number of rows of data read from disk for each. This will likely confirm this theory.

So how do you "balance" the value of metadata for all 3 date columns? You need to add columns that reduce the resolution for the first 2 sort keys. These "synthetic" columns don't replace the date columns used in queries but are only used for sorting the table. (watch the video if this isn't clear) Also it is usually a good idea to identify these synthetic columns by giving them some identification in the name - like starting these with double underscore.

I propose you add 2 columns to your table - __postdate_sortkey and __effectiveDate_sortkey and use these as the first and second sortkeys of the table. The third sortkey should be businessdate so that those queries will receive benefit too. A good starting point for the definition of __postdate_sortkey could be date_trunc('year', postdate) and for __effectiveDate_sortkey, date_trunc('month', effectiveDate). Of course you will need to profile your performance to make sure this amount of coarseness is correct and this will need to be done when the table is vacuumed (resorted) and analyzed (metadata updated).

Again you don't change the query WHERE clauses. This is just setting a new sort order for the table where the metadata for the actual date columns provides the needed optimization for all queries. Note that this "balancing" of query performance will likely impact postdate queries slightly but that the overall solution performance will be greatly improved.

There are ways to inspect the metadata directly to better fine tune things but that is likely too involved for a quick answer here. Hopefully this level of optimization isn't needed.

Hope this helps.

Upvotes: 2

Related Questions