Sachiko
Sachiko

Reputation: 924

Redshift : What can be appropriate combination of DIST KEY and SORT KEY for Master Tables?

Let me ask about what can be approptiate distkey(dist style) and sort key for master tables in Redshift.

We have several master tables with differnet size, purpose and cardinalities and now we're verify what SORTKEY and DISTKEY combination would be more appropriate.

For example, we have two main key in our customer master table, frequent use in our BI queries and join key for other tables, as follows;

Ex. MST_CUSTOMER (about 7 millions, 850MB in SQL Server)

CUSTOMER_ID     <-- Primary Key, Unique and High Cardinality
CUSTOMER_CATEGORY   <-- Low Cardinality (like VIP, BZ) but not always uses in BI query
CUST_NAME
CUST_ADDRESS...

First, if we can use both columns for SORTKEY, which is the best order in this case?

1.Should Low cardinality comes first (CUST_CATEGORY, CUST_ID) 
2.Hifh frequency comes first (CUST_ID, CUST_CATEGORY)
3.should chose only single column eigher of CUST_ID or CUST_CATEGORY

And generally, may I ask what is the best combination in such case?

1.DISTSTYLE KEY & COMPOUND SORTKEY
2.DISTSTYLE KEY & SORTKEY
3.DISTSTYLE ALL & INTERLEAVED SORTKEY...

I imagine if we would find 'DS_BCAST_INNER' in query plan frequently, we'd better to consider 'DISTSTYLE ALL'. But still not sure what can be best practice for master tables.

Any suggestion would be appreciated again.

Best Regards

Upvotes: 2

Views: 2731

Answers (1)

Nate
Nate

Reputation: 210

If you join often by customer_id then make that the sort key. Customer_category would make a good secondary sort key. The data should be distributed together with data that it will be joined to. If you have fact tables distributed by customer Id then make that the distribution key. It also depends on how many new customers you get. If sales are heavily skewed towards new customers then your data distribution will be skewed and customer_id would not be a good distribution key.

Upvotes: 2

Related Questions