Reputation: 41
My question is let's suppose I have a table A with 216 columns :
CREATE TABLE test_data
(
fund_name varchar(255)
fund_desc varchar(255)
accountId varchar(255)
postdate date,
effectiveDate date,
businessdate Date,
activeTill TimeStamp
)
Our use case is as follows:
Frequent updates to the table as data is getting loaded based on kafka events through out the day.
Historical Data stores .Hence billions of rows and data keep getting added every day.
10 node cluster.
We have below filter criteria for our data distribution:
activeTill
will always have value as "9999-09-99,23:09:0' for all active record; accountId
count can max go up to 12k distinct values.
Query #1:
select *
from fund_data
where postdate between '2024-06-01' and '2024-06-20'
and accountId in ('1234', '56789', '0009')
and activeTill > sysdate
Query #2:
select *
from fund_data
where effectiveDate between '2024-06-01' and '2024-06-20'
and accountId in ('1234', '56789', '0009')
and activeTill > sysdate
Query #3:
select *
from fund_data
where businessdate between '2024-06-01' and '2024-06-20'
and accountId in ('1234', '56789', '0009')
and activeTill > sysdate
In our current set up, distkey
is AUTO and SORT_KEY
as (postdate, accountId, sysdate).
The query performance was ok when only postDate
was getting used in where clause as it is the 1st sort key.
Now since consumers started using other date fields equally, those queries performing worst.
So, looking for recommendation ,in such cases what can be ideal SORT key and DIST key.
Upvotes: 0
Views: 256
Reputation: 11082
This is a common optimization change that people have to go through. How to balance the optimization gains for multiple sort keys. I did a re:Invent talk that included this a few years back which may add some more context - https://www.youtube.com/watch?v=bxfnWTiY7EM&t
This all comes down to optimizing the use of your block metadata. In your example when you sort by postdate first this will use all the resolution of this column before any other column's data is used to order rows. This means that values for postdate are lumped together resulting in metadata values for this column that have small ranges of values and therefore only a few blocks of data need to be read when querying based on this column in the WHERE clause.
However, the values for the other date columns, effectiveDate and businessdate, are going to be scattered all over the table. So when these columns are in the WHERE clause the entire table's data needs to be read from disk. This is likely what is causing a significant portion of the slowdown you are experiencing for these queries. Check the scan steps for all 3 types of queries and note the number of rows of data read from disk for each. This will likely confirm this theory.
So how do you "balance" the value of metadata for all 3 date columns? You need to add columns that reduce the resolution for the first 2 sort keys. These "synthetic" columns don't replace the date columns used in queries but are only used for sorting the table. (watch the video if this isn't clear) Also it is usually a good idea to identify these synthetic columns by giving them some identification in the name - like starting these with double underscore.
I propose you add 2 columns to your table - __postdate_sortkey and __effectiveDate_sortkey and use these as the first and second sortkeys of the table. The third sortkey should be businessdate so that those queries will receive benefit too. A good starting point for the definition of __postdate_sortkey could be date_trunc('year', postdate) and for __effectiveDate_sortkey, date_trunc('month', effectiveDate). Of course you will need to profile your performance to make sure this amount of coarseness is correct and this will need to be done when the table is vacuumed (resorted) and analyzed (metadata updated).
Again you don't change the query WHERE clauses. This is just setting a new sort order for the table where the metadata for the actual date columns provides the needed optimization for all queries. Note that this "balancing" of query performance will likely impact postdate queries slightly but that the overall solution performance will be greatly improved.
There are ways to inspect the metadata directly to better fine tune things but that is likely too involved for a quick answer here. Hopefully this level of optimization isn't needed.
Hope this helps.
Upvotes: 2