Reputation: 95
I have a busy OLTP table with 30 columns and 50 million rows, and I want to avoid duplicates in it.
What approach should I take?
So far I came up with these:
With the latter I feel like there will be a lot of hassle regenerating that hash-column if a table schema would change.
Maybe there are some other approaches that I didn't think of?
Upvotes: 2
Views: 1726
Reputation: 657002
... just came out with a built-in hash function for records, which is substantially cheaper than my custom function. Especially for many columns! See:
That makes the expression index much more attractive than a generated column plus index. So just:
CREATE UNIQUE INDEX tbl_row_uni ON tbl (hash_record_extended(tbl.*,0));
This normally works, too:
CREATE UNIQUE INDEX tbl_row_uni ON tbl (hash_record_extended(tbl,0));
But the first variant is safer. In the second variant tbl
would resolve to the column if a column of the same name should exist.
I provided a solution for that problem exactly on dba.SE recently:
It's pretty close to your third idea:
Basically, a very efficient server-side generated hash placed as 31th column with UNIQUE
constraint.
CREATE OR REPLACE FUNCTION public.f_tbl_bighash(col1 text, col2 text, ... , col30 text)
RETURNS bigint
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT hashtextextended(textin(record_out(($1,$2, ... ,$30))), 0)';
ALTER TABLE tbl
ADD COLUMN tbl_bighash bigint NOT NULL GENERATED ALWAYS AS (public.f_tbl_bighash(col1, col2, ... , col30)) STORED -- append column in last position
, ADD CONSTRAINT tbl_bighash_uni UNIQUE (tbl_bighash);
The beauty of it: it works efficiently without changing anything else. (Except, possibly, where you use SELECT *
or INSERT INTO
without target list or similar.)
And it works for NULL
values, too (treating them as equal).
Careful if any column types have non-immutable text representation. (Like timestamptz
.) The solution is tested with all text
columns.
If the table schema changes, drop the UNIQUE
constraint first, recreate the function and recreate the generated column - ideally with a single ALTER TABLE
statement, so you don't rewrite the table twice.
Alternatively, use a UNIQUE
expression index based on public.f_tbl_bighash()
. Same effect. Upside: no additional table column. Downside: a bit more expensive, computationally.
Upvotes: 3