Reputation: 11335
I have large number of tables that are "INSERT once", and then read-only after that. ie: After the initial INSERT
of a record, there are never any UPDATE
or DELETE
statements. Due to this, the data fragmentation on disk for the tables is minimal.
I'm now considering adding an needs_action
boolean field to every table. This field will only ever get updated once, and it will be done on a slow/periodic basis. As a result of MVCC, when VACUUM
comes along (with an even slower schedule) after the UPDATE
, the table becomes quite fragmented as it clears out the originally inserted tuples, and they get subsequently backfilled by new insertions.
In short: The addition of this single "always updated once" field changes the table from being being minimally fragmented by design, to being highly fragmented by design.
Is there some method of effectively achieving this single needs_action
record flagging that can avoid the resulting table fragmentation?
.
.
.
.
<And now for some background/supplemental info...>
Some options considered so far...
At the risk of making this question massive (and therefore overlooked?), below are some options that have been considered so far:
Just add the column to each table, do the UPDATE
and don't worry about resulting fragmentation until it actually proves to be an issue.
Make an independent tracking table (for every table) only containing A) the PK from the master table, and B) the needs_action
flag. Create a record in the tracking table with an AFTER INSERT
trigger in the master table
Forcing the needs_action
field to be a HOT update to avoid tuple replication
WHERE needs_action = TRUE
seems to rule out this option, but maybe there's another way to find them quickly?Using table fillfactor (at 50?) to leave space for the inevitable UPDATE
UPDATE
, and therefore keep it in the same pageUPDATE
it seems like this will leave the table packing fraction eternally at 50% and take twice the storage space? I don't 100% understand this option yet... still learning.Find a special/magical field/bit in the master table records that can be twiddled without MVCC impact.
WHERE needs_action = TRUE
partial index)Storing needs_action
outside of postgres (eg: as a <table_name>:needs_copying
list of PKs in redis) to evade fragmentation due to mvcc.
redis_fdw
(or some other fdw?) in an AFTER INSERT
trigger can keep it atomic? I need to learn more about fdw capabilities... seems like all fdw's I can find are all read-only, though.Run a fancy view with background defrag/compacting, as described in this fantastic article
Simply track ids/PKs that need copying in a postgres table
DELETE
the records when action is completedRPUSH
ing to an offline redis list (but definitely ACID)Are there other options to consider?
More on the specific use case driving this...
I'm interested in the general case how to avoid this fragmentation, but here's a bit more on the current use case:
SELECT
queries will be across wide table ranges (not just recent data) and can range from single result records to 100k+Upvotes: 3
Views: 2108
Reputation: 247225
I would just lower the fillfactor below the default value of 100.
Depending on the size of the row, use a value like 80 or 90, so that a few new rows will still fit into the block. After the update, the old row can be “pruned” and defragmented by the next transaction so that the space can be reused.
A value of 50 seems way too low. True, this would leave room for all rows in a block being updated at the same time, but that is not your use case, right?
Upvotes: 2
Reputation: 3266
You can try to use inherited tables. This approach does not directly support PK for tables, but it might be resolved by triggers.
CREATE TABLE data_parent (a int8, updated bool);
CREATE TABLE data_inserted (CHECK (NOT updated)) INHERITS (data_parent);
CREATE TABLE data_updated (CHECK (updated)) INHERITS (data_parent);
CREATE FUNCTION d_insert () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = false;
INSERT INTO data_inserted VALUES (NEW.*);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_insert BEFORE INSERT ON data_parent FOR EACH ROW EXECUTE PROCEDURE d_insert();
CREATE FUNCTION d_update () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = true;
INSERT INTO data_updated VALUES (NEW.*);
DELETE FROM data_inserted WHERE (data_inserted.*) IN (OLD);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_update BEFORE INSERT ON data_inserted FOR EACH ROW EXECUTE PROCEDURE d_update();
-- GRANT on d_insert to regular user
-- REVOKE insert / update to regular user on data_inserted/updated
INSERT INTO data_parent (a) VALUES (1);
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
INSERT 0 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | f
(1 row)
a | updated
---+---------
(0 rows)
UPDATE data_parent SET updated = true;
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
UPDATE 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | t
(1 row)
Upvotes: 1