LuisHerrera
LuisHerrera

Reputation: 11

Amazon Redshift, getting slower in every update run

I am beginning with Amanzon Redshift.

I have just loaded a big table, millions of rows and 171 fields. The data quality is poor, there are a lot of characters that must be removed.

I have prepare updates for every column, since redshift stores in column mode, I suppose it is faster by column.

UPDATE MyTable SET Field1 = REPLACE(Field1, '~', '');

UPDATE MyTable SET Field2 = REPLACE(Field2, '~', '');

. . .

UPDATE MyTable set FieldN = Replace(FieldN, '~', '');

The first 'update' took 1 min. The second one took 1 min and 40 sec...

Every time a run one of the updates, it takes more time than the last one. I have run 19 and the last one took almost 25 min. The time consumed by every 'update' increases one after another.

Another thing is that with the first update, the cpu utilization was minimal, now with the last update it is taking 100%

I have a 3-nodes cluster of dc1.large instances.

I have rebooted the cluster but the problem continues.

Please, I need orientation to find the cause of this problem.

Upvotes: 0

Views: 1873

Answers (2)

avertocle
avertocle

Reputation: 621

A more optimal way might be

  1. Create another identical table.
  2. Read N ( say 10000) rows at a time from first table, process and load into the second table using s3 loading (instead of insert).
  3. Delete first table and rename second table
  4. If are running into space issues, delete the N migrated rows after every iteration from the first table and run vacuum delete only <name_of_first_table>

Refrences

  1. s3 loading : http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html

copy table from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;

Upvotes: 0

Jon Ekiz
Jon Ekiz

Reputation: 1022

When you update a column, Redshift actually deletes all those rows and inserts new rows with the new value. So there are bunch of space that needs to be reclaimed. So you need to VACUUM your table after the update.

They also recommend that you run ANALYZE after each update to update statistics for the query planner.

http://docs.aws.amazon.com/redshift/latest/dg/r_UPDATE.html

Upvotes: 1

Related Questions