senior_citizen_
senior_citizen_

Reputation: 147

Why not to use timestamp with Interleaved Sortkey?

I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:

Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.

And yet, in their own example, Amazon uses interleaved key on a date column with good performance.

So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?

Upvotes: 11

Views: 2049

Answers (4)

Timestamp column may go to hours, minutes, seconds and milliseconds which is costly to sort the data. A data with milliseconds granularity is like having too much degree of zone maps to keep a record where the data starts and ends within dataset. Same is not true with date column in sort key. Date column has less degress of zone maps to keep a track of data residing in the table.

Upvotes: 0

Nathan Griffiths
Nathan Griffiths

Reputation: 12756

From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html

As you add rows to a sorted table that already contains data, the unsorted region grows, which has a significant effect on performance. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.

The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.

Upvotes: 6

Fact
Fact

Reputation: 2450

The date(not timestamp) column as a interleaved sort key makes sense when you know in an average X number of rows are processed everyday and you are going to filter based on it, if you are not going to use it then leave it out. Also a note on vacuum - when the VACUUM process is in progress, it needs temporary space to be able to complete the task by sorting and then merging the data in chunks. Cancelling the VACUUM process mid flight will cause extra spaces to be not reclaimed so if for some reason any Vacuum has ever been cancelled in your cluster this can be accounted to the space increase. See the link https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html#r_VACUUM_usage_notes and point 3 the last point is of particular interest. In my case the tables ended up growing very rapidly compared to the amount of rows inserted and had to build an auto table creation using deep copy

Upvotes: 0

Łukasz Kamiński
Łukasz Kamiński

Reputation: 5930

I think it might have been explained later on when they describe issues around vacuuming/reindexing:

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns. As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns. If the skew becomes too large, performance might be affected.

So if that is the only reason, then it just means you will have increased maintenance on index.

Upvotes: 5

Related Questions