Reputation: 924
Let me ask about what can be approptiate distkey(dist style) and sort key for master tables in Redshift.
We have several master tables with differnet size, purpose and cardinalities and now we're verify what SORTKEY and DISTKEY combination would be more appropriate.
For example, we have two main key in our customer master table, frequent use in our BI queries and join key for other tables, as follows;
Ex. MST_CUSTOMER (about 7 millions, 850MB in SQL Server)
CUSTOMER_ID <-- Primary Key, Unique and High Cardinality
CUSTOMER_CATEGORY <-- Low Cardinality (like VIP, BZ) but not always uses in BI query
CUST_NAME
CUST_ADDRESS...
First, if we can use both columns for SORTKEY, which is the best order in this case?
1.Should Low cardinality comes first (CUST_CATEGORY, CUST_ID)
2.Hifh frequency comes first (CUST_ID, CUST_CATEGORY)
3.should chose only single column eigher of CUST_ID or CUST_CATEGORY
And generally, may I ask what is the best combination in such case?
1.DISTSTYLE KEY & COMPOUND SORTKEY
2.DISTSTYLE KEY & SORTKEY
3.DISTSTYLE ALL & INTERLEAVED SORTKEY...
I imagine if we would find 'DS_BCAST_INNER' in query plan frequently, we'd better to consider 'DISTSTYLE ALL'. But still not sure what can be best practice for master tables.
Any suggestion would be appreciated again.
Best Regards
Upvotes: 2
Views: 2731
Reputation: 210
If you join often by customer_id then make that the sort key. Customer_category would make a good secondary sort key. The data should be distributed together with data that it will be joined to. If you have fact tables distributed by customer Id then make that the distribution key. It also depends on how many new customers you get. If sales are heavily skewed towards new customers then your data distribution will be skewed and customer_id would not be a good distribution key.
Upvotes: 2