rodrigocf
rodrigocf

Reputation: 2099

Long vacuums in redshift

I'm trying to run VACUUM REINDEX for some huge tables in Redshift. When I run one of those vacuums in SQLWorkbenchJ, it never finishes and returns a connection reset by peer after about 2 hours. Same thing actually happens in Python when I run the vacuums using something like this:

conn_string = "postgresql+pg8000://%s:%s@%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,
                                       execution_options={'autocommit': True},
                                       encoding='utf-8',
                                       connect_args={"keepalives": 1, "keepalives_idle": 60,
                                                             "keepalives_interval": 60},
                                               isolation_level="AUTOCOMMIT")
conn.execute(query)

Is there a way that either using Python or SQLWorkbenchJ I can run these queries? I expect them to last at least an hour each. Is this expected behavior?

Upvotes: 1

Views: 1899

Answers (1)

cheframzi
cheframzi

Reputation: 156

Short Answer

You might need to add a mechanism in your python script to retry when the reindexing fails, based on https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html

If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the full vacuum operation.

However...

Couple of things to note (I apologize if you already know this)

So how does this answer the question?

  • If your table is using compound sorting or no sorting at all VACUUM REINDEX is not necessary at all, it brings no value
  • If your table is using interleaved, you will need to first check whether you even need to re-index?. Sample query

    SELECT tbl AS table_id,
       (col + 1) AS column_num, -- Column in this view is zero indexed
       interleaved_skew,
       last_reindex
    FROM svv_interleaved_columns
    
  • If the value of the skew is 1.0 you for sure don't need REINDEX

Bringing it all together

  • You could have your python script run the query listed in https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_INTERLEAVED_COLUMNS.html to find the tables that you need to re-index (maybe you add some business logic that works better for your situation, example: your own sort skew threshold)
  • REINDEX applies the worst type of lock, so try to target the run of the script during off hours if possible
  • Challenge the need for interleaving sorting and favor compound

Upvotes: 3

Related Questions