Unique index with a lots of columns

Question

I have a busy OLTP table with 30 columns and 50 million rows, and I want to avoid duplicates in it.
What approach should I take?

So far I came up with these:

direct unique index on 30 columns (sounds like a huge slow index)
some sort of calculated unique index of sum of 30 columns
an application-side generated hash placed in a new - 31th column with btree unique index

With the latter I feel like there will be a lot of hassle regenerating that hash-column if a table schema would change.

Maybe there are some other approaches that I didn't think of?

Erwin Brandstetter · Accepted Answer

Postgres 14

... just came out with a built-in hash function for records, which is substantially cheaper than my custom function. Especially for many columns! See:

Generate hash id for records in DB

That makes the expression index much more attractive than a generated column plus index. So just:

CREATE UNIQUE INDEX tbl_row_uni ON tbl (hash_record_extended(tbl.*,0));

This normally works, too:

CREATE UNIQUE INDEX tbl_row_uni ON tbl (hash_record_extended(tbl,0));

But the first variant is safer. In the second variant tbl would resolve to the column if a column of the same name should exist.

Postgres 13 (original answer)

I provided a solution for that problem exactly on dba.SE recently:

Why doesn't my UNIQUE constraint trigger?

It's pretty close to your third idea:

Basically, a very efficient server-side generated hash placed as 31th column with UNIQUE constraint.

CREATE OR REPLACE FUNCTION public.f_tbl_bighash(col1 text, col2 text, ... , col30 text)
  RETURNS bigint 
  LANGUAGE sql IMMUTABLE PARALLEL SAFE AS 
'SELECT hashtextextended(textin(record_out(($1,$2, ... ,$30))), 0)';

ALTER TABLE tbl
  ADD COLUMN tbl_bighash bigint NOT NULL GENERATED ALWAYS AS (public.f_tbl_bighash(col1, col2, ... , col30)) STORED  -- append column in last position
, ADD CONSTRAINT tbl_bighash_uni UNIQUE (tbl_bighash);

The beauty of it: it works efficiently without changing anything else. (Except, possibly, where you use SELECT * or INSERT INTO without target list or similar.)

And it works for NULL values, too (treating them as equal).

Careful if any column types have non-immutable text representation. (Like timestamptz.) The solution is tested with all text columns.

If the table schema changes, drop the UNIQUE constraint first, recreate the function and recreate the generated column - ideally with a single ALTER TABLE statement, so you don't rewrite the table twice.

Alternatively, use a UNIQUE expression index based on public.f_tbl_bighash(). Same effect. Upside: no additional table column. Downside: a bit more expensive, computationally.

Unique index with a lots of columns

Answers (1)

Postgres 14

Postgres 13 (original answer)

Related Questions