Reputation: 11
I am beginning with Amanzon Redshift.
I have just loaded a big table, millions of rows and 171 fields. The data quality is poor, there are a lot of characters that must be removed.
I have prepare updates for every column, since redshift stores in column mode, I suppose it is faster by column.
UPDATE MyTable SET Field1 = REPLACE(Field1, '~', '');
UPDATE MyTable SET Field2 = REPLACE(Field2, '~', '');
. . .
UPDATE MyTable set FieldN = Replace(FieldN, '~', '');
The first 'update' took 1 min. The second one took 1 min and 40 sec...
Every time a run one of the updates, it takes more time than the last one. I have run 19 and the last one took almost 25 min. The time consumed by every 'update' increases one after another.
Another thing is that with the first update, the cpu utilization was minimal, now with the last update it is taking 100%
I have a 3-nodes cluster of dc1.large instances.
I have rebooted the cluster but the problem continues.
Please, I need orientation to find the cause of this problem.
Upvotes: 0
Views: 1873
Reputation: 621
A more optimal way might be
vacuum delete only <name_of_first_table>
Refrences
copy table from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;
Upvotes: 0
Reputation: 1022
When you update a column, Redshift actually deletes all those rows and inserts new rows with the new value. So there are bunch of space that needs to be reclaimed. So you need to VACUUM your table after the update.
They also recommend that you run ANALYZE after each update to update statistics for the query planner.
http://docs.aws.amazon.com/redshift/latest/dg/r_UPDATE.html
Upvotes: 1