MatBailie
MatBailie

Reputation: 86706

VACUUM on Redshift (AWS) after DELETE and INSERT

I have a table as below (simplified example, we have over 60 fields):

CREATE TABLE "fact_table" (
  "pk_a" bigint                 NOT NULL ENCODE lzo,
  "pk_b" bigint                 NOT NULL ENCODE delta,
  "d_1"  bigint                 NOT NULL ENCODE runlength,
  "d_2"  bigint                 NOT NULL ENCODE lzo,
  "d_3"  character varying(255) NOT NULL ENCODE lzo,
  "f_1"  bigint                 NOT NULL ENCODE bytedict,
  "f_2"  bigint                     NULL ENCODE delta32k
)
DISTSTYLE KEY
DISTKEY ( d_1 )
SORTKEY ( pk_a, pk_b );

The table is distributed by a high-cardinality dimension.

The table is sorted by a pair of fields that increment in time order.

The table contains over 2 billion rows, and uses ~350GB of disk space, both "per node".


Our hourly house-keeping involves updating some recent records (within the last 0.1% of the table, based on the sort order) and inserting another 100k rows.

Whatever mechanism we choose, VACUUMing the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes over 6 hours

We can see from SELECT * FROM svv_vacuum_progress; that all 2billion rows are being merged. Even though the first 99.9% are completely unaffected.


Our understanding was that the merge should only affect:
1. Deleted records
2. Inserted records
3. And all the records from (1) or (2) up to the end of the table


We have tried DELETE and INSERT rather than UPDATE and that DML step is now significantly quicker. But the VACUUM still merges all 2billion rows.

DELETE FROM fact_table WHERE pk_a > X;
-- 42 seconds

INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b;
-- 90 seconds

VACUUM fact_table;
-- 23645 seconds

In fact, the VACUUM merges all 2 billion records even if we just trim the last 746 rows off the end of the table.


The Question

Does anyone have any advice on how to avoid this immense VACUUM overhead, and only MERGE on the last 0.1% of the table?

Upvotes: 8

Views: 7000

Answers (2)

androboy
androboy

Reputation: 835

Couldn't fix it in comments section, so posting it as answer

I think right now, if the SORT keys are same across the time series tables and you have a UNION ALL view as time series view and still performance is bad, then you may want to have a time series view structure with explicit filters as

create or replace view schemaname.table_name as 
select * from table_20140901 where sort_key_date = '2014-09-01' union all 
select * from table_20140902 where sort_key_date = '2014-09-02' union all .......
select * from table_20140925 where sort_key_date = '2014-09-25';

Also make sure to have stats collected on all these tables on sort keys after every load and try running queries against it. It should be able to push down any filter values into the view if you are using any. End of day after load, just run a VACUUM SORT ONLY or full vacuum on the current day's table which should be much faster.

Let me know if you are still facing any issues after the above test.

Upvotes: 0

Joe Harris
Joe Harris

Reputation: 14035

How often are you VACUUMing the table? How does the long duration effect you? our load processing continues to run during VACUUM and we've never experienced any performance problems with doing that. Basically it doesn't matter how long it takes because we just keep running BAU.

I've also found that we don't need to VACUUM our big tables very often. Once a week is more than enough. Your use case may be very performance sensitive but we find the query times to be within normal variations until the table is more than, say, 90% unsorted.

If you find that there's a meaningful performance difference, have you considered using recent and history tables (inside a UNION view if needed)? That way you can VACUUM the small "recent" table quickly.

Upvotes: 1

Related Questions