RedShift - Why you shouldn't compress the sortykey column?

Question

I know may experts suggest this, even I follow this as best practice(Read it from AWS Blog), there is a very deep doc about this in Github, but still I'm confused with this term. It'll affect the range-restricted scan and not able to understand this concept.

Can someone give me an example, that clarifies why we shouldn't use the compression on the sort key column?

Bill Weiner · Accepted Answer

So the reality is simple executable answers are often not perfect but the best rule of thumb. You say you have read the docs so I won't go into detail. The assumption behind this recommendation is that the sort key is also the common where clause in many queries. This is important to make sense of the recommendation but it is generally true. I have lots of queries with "where date_col > getdate() - interval '1 year'" from which you decide to make the sort key of the table "date_col" - very typical.

Now when you run this type of query Redshift leader node will check the where condition against the block meta data for the date_col column. Whichever blocks have the desired dates within them then these block "match". Now you are going to look at the data for other columns as well. To get the needed blocks for these columns Redshift uses another piece of meta data for the date_col column - namely the row number range that are in each matching block. These row number ranges are used to find the blocks for other columns based on the metadata for those columns. I hope this makes sense - find the blocks that match the where clause then find the blocks in other columns. All of these to not read blocks that aren't needed for the query.

Now for the example - if you have a table with 2 columns: 1) sort key column is an INT and 2) a large varchar. Both are compressed. Now the first column (INT) is in sort order and will be highly compressed. Let's say that this column fits in 1 block. The other column (large varchar) takes 10 blocks. We run our query with a where clause on the INT column, it matches the 1 block, but not the row numbers needed in the other column results in getting all 10 blocks. No savings in disk read bandwidth. But if the INT column is not compressed it will take up more blocks - let's say 8 blocks. The same query will match only one of the 8 blocks of the INT column and the row number cross reference to the varchar column may match only 3 of the 10 block for that column. Now we have reduced the data read from disk.

Hopefully that makes sense. You can see that there are a number of assumptions behind this recommendation which are true more often than not. Without these assumptions it is hard to figure out whys they say this. Namely that your sort key is your common where clause, that the compression of the sortkey column will be much better than other columns, and that the data stored in the sortkey is smaller than data in other columns. And a few others but less central.

Did this help?

RedShift - Why you shouldn't compress the sortykey column?

Answers (1)

Related Questions

RedShift - Why you shouldn&#39;t compress the sortykey column?

Answers (1)

Related Questions

RedShift - Why you shouldn't compress the sortykey column?