Long vacuums in redshift

Question

I'm trying to run VACUUM REINDEX for some huge tables in Redshift. When I run one of those vacuums in SQLWorkbenchJ, it never finishes and returns a connection reset by peer after about 2 hours. Same thing actually happens in Python when I run the vacuums using something like this:

conn_string = "postgresql+pg8000://%s:%s@%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,
                                       execution_options={'autocommit': True},
                                       encoding='utf-8',
                                       connect_args={"keepalives": 1, "keepalives_idle": 60,
                                                             "keepalives_interval": 60},
                                               isolation_level="AUTOCOMMIT")
conn.execute(query)

Is there a way that either using Python or SQLWorkbenchJ I can run these queries? I expect them to last at least an hour each. Is this expected behavior?

cheframzi · Accepted Answer

Short Answer

You might need to add a mechanism in your python script to retry when the reindexing fails, based on https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html

If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the full vacuum operation.

However...

Couple of things to note (I apologize if you already know this)

Tables in redshift can have N sort keys (columns that data are sorted by) and Redshift supports only 2 sorting styles
- Compound: You are really sorting based on the first sort column and then on the second, ...
- Interleaved: The table will be sorted on all sort columns (https://en.wikipedia.org/wiki/Z-order_curve), some people would choose this style when they are not sure how the table will be used. However, it comes with a lot of issues on its own (More solid documentation here https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-compound-and-interleaved-sort-keys/ where compound sorting is generally favored)

So how does this answer the question?

If your table is using compound sorting or no sorting at all VACUUM REINDEX is not necessary at all, it brings no value

If your table is using interleaved, you will need to first check whether you even need to re-index?. Sample query

SELECT tbl AS table_id,
   (col + 1) AS column_num, -- Column in this view is zero indexed
   interleaved_skew,
   last_reindex
FROM svv_interleaved_columns

If the value of the skew is 1.0 you for sure don't need REINDEX

Bringing it all together

You could have your python script run the query listed in https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_INTERLEAVED_COLUMNS.html to find the tables that you need to re-index (maybe you add some business logic that works better for your situation, example: your own sort skew threshold)
REINDEX applies the worst type of lock, so try to target the run of the script during off hours if possible
Challenge the need for interleaving sorting and favor compound

Long vacuums in redshift

Answers (1)

Short Answer

However...

So how does this answer the question?

Bringing it all together

Related Questions