Alez
Alez

Reputation: 95

Unique index with a lots of columns

I have a busy OLTP table with 30 columns and 50 million rows, and I want to avoid duplicates in it.
What approach should I take?

So far I came up with these:

With the latter I feel like there will be a lot of hassle regenerating that hash-column if a table schema would change.

Maybe there are some other approaches that I didn't think of?

Upvotes: 2

Views: 1726

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657002

Postgres 14

... just came out with a built-in hash function for records, which is substantially cheaper than my custom function. Especially for many columns! See:

That makes the expression index much more attractive than a generated column plus index. So just:

CREATE UNIQUE INDEX tbl_row_uni ON tbl (hash_record_extended(tbl.*,0));

This normally works, too:

CREATE UNIQUE INDEX tbl_row_uni ON tbl (hash_record_extended(tbl,0));

But the first variant is safer. In the second variant tbl would resolve to the column if a column of the same name should exist.

Postgres 13 (original answer)

I provided a solution for that problem exactly on dba.SE recently:

It's pretty close to your third idea:

Basically, a very efficient server-side generated hash placed as 31th column with UNIQUE constraint.

CREATE OR REPLACE FUNCTION public.f_tbl_bighash(col1 text, col2 text, ... , col30 text)
  RETURNS bigint 
  LANGUAGE sql IMMUTABLE PARALLEL SAFE AS 
'SELECT hashtextextended(textin(record_out(($1,$2, ... ,$30))), 0)';

ALTER TABLE tbl
  ADD COLUMN tbl_bighash bigint NOT NULL GENERATED ALWAYS AS (public.f_tbl_bighash(col1, col2, ... , col30)) STORED  -- append column in last position
, ADD CONSTRAINT tbl_bighash_uni UNIQUE (tbl_bighash);

The beauty of it: it works efficiently without changing anything else. (Except, possibly, where you use SELECT * or INSERT INTO without target list or similar.)

And it works for NULL values, too (treating them as equal).

Careful if any column types have non-immutable text representation. (Like timestamptz.) The solution is tested with all text columns.

If the table schema changes, drop the UNIQUE constraint first, recreate the function and recreate the generated column - ideally with a single ALTER TABLE statement, so you don't rewrite the table twice.

Alternatively, use a UNIQUE expression index based on public.f_tbl_bighash(). Same effect. Upside: no additional table column. Downside: a bit more expensive, computationally.

Upvotes: 3

Related Questions