How to plan diststyle for Redshift table with inserts and updates

Question

I have a requirement where my Redshift wouldn't just be a semantic layer for frontend, but also would be used for inserts and updates on tables.

Doubts:

1) Front-end would be a simple framework which would fetch the table to UI and show it with pagination, as of now we're doing select * from table and it takes around 10 seconds to fetch around 3000 rows. Can it be made faster?

2) This is a pretty new use-case for me and I'm trying to figure out which distribution style would be best in this scenario? Data is very small, around tens of thousands only. I'm using diststyle all, as the documentation suggests to make it all for any table less than 1 million rows.

3) For Inserts/Updates we need a unique column, so we're creating a custom identity (1,1) column on top of the table, and making it the sortkey because every update will be done by searching for the unique row in the DB, inserting would simply add an incremental value to it. Is it the right way or are there more sophisticated ways to address this problem?

4) Any other suggestion is most welcome.

John Rotenstein · Accepted Answer

A data warehouse like Amazon Redshift is quite poor at doing INSERT and UPDATE operations.

The reason is that whenever a row is modified (UPDATE), the current row is marked as Deleted and a new row is appended at the end of the storage space. This applies even if only one value in one column is modified. This is because data is compressed within storage blocks and you can't modify compressed data without rewriting the whole block.

When data is added with INSERT, the new rows are added at the end of the storage area for each column. (Being a columnar database, each column is stored separately.) This means that the unsorted region grows whenever data is appended, making it less efficient to find data with a table. This can be remedied by running a VACUUM, which will re-sort the rows.

Amazon Redshift is not a good choice for use as a standard OLTP database. Rather, it is best for loading lots of information from existing data sources and running complex queries across millions and billions of rows.

You might be better-off doing such updates in a normal database, then extracting the data into Redshift for reporting ("Read-Only") purposes.

As to DISTKEY/SORTKEY, the general rule is:

Set the DISTKEY to the column most commonly used in a JOIN because it co-locates the data from both tables onto the same slice
Set the SORTKEY to the column most commonly used in a WHERE statement because it allows Redshift to "skip over" disk blocks that do not contain matching rows.

How to plan diststyle for Redshift table with inserts and updates

Answers (1)

Related Questions