Reputation: 7379
When storing a tsvector
value in a column, for records with no search terms, should I store an empty tsvector
or a NULL
value?
Does it matter?
Is there any difference in terms of performance, or storage overhead from storing empty vectors?
In other words, when updating the vector based on a value of, say, a nullable title
column, do I need to always compute this as to_tsvector(coalesce(title,''))
(since to_tsvector
returns NULL
when given a NULL
argument) or is it enough to do to_tsvector(title)
?
Upvotes: 16
Views: 2408
Reputation: 221135
First off, the semantics of SQL NULL
is that of UNKNOWN
, whereas some data types also have an "empty" value. Those data types include:
TEXT
(''
isn't the same as NULL::TEXT
)JSON
and JSONB
([]
or {}
aren't the same as NULL::JSON
or NULL:JSONB
)X[]
(ARRAY[]::X[]
isn't the same as NULL::X[]
)There are many more, including TSVECTOR
. The semantics of an empty collection of something is always subtly different from that of a NULL
value, which is an UNKNOWN
collection (often just used as an absent collection, though). The distiction manifests specifically when it comes to using operators, e.g.
'' || 'abc' = 'abc'
but NULL || 'abc' IS NULL
to_tsvector('cats ate rats') @@ to_tsquery('cat & rat') = true
but NULL @@ to_tsquery('cat & rat') IS NULL
In that sense, the decision should be foremost a logic one, not a storage one, based on this question: Will you still work with the TSVECTOR
value of a record, even if the record doesn't have any search terms (pro empty TSVECTOR
)? Or does the feature not apply at all to that particular record (pro NULL
value)? For the @@
operator, it may not be so relevant, but it definitely is for the ||
operator, and others.
The answer isn't obvious, nor is there a clear right / wrong way in general.
If this is a highly performance sensitive situation in your application (e.g. you have a lot of empty TSVECTOR
values), then maybe, this benchmark could help you with the decision?
I ran the below benchmark on PostgreSQL 14.1 in Docker to get this result:
RUN 1, Statement 1: 2.91145
RUN 1, Statement 2: 1.00000 -- The fastest run is 1. The others are multiples of 1
RUN 2, Statement 1: 2.80509
RUN 2, Statement 2: 1.05232
RUN 3, Statement 1: 2.78001
RUN 3, Statement 2: 1.00202
RUN 4, Statement 1: 2.74319
RUN 4, Statement 2: 1.00524
RUN 5, Statement 1: 2.75808
RUN 5, Statement 2: 1.00045
SELECT v @@ to_tsquery('cat & rat')
with v tsvector = to_tsvector('');
SELECT NULL @@ to_tsquery('cat & rat')
The fact that NULL
is involved probably leads to a shortcut in the @@
operator's algorithm, which produces a 2.7x performance improvement over querying an empty TSVECTOR
in the benchmark. So, there do seem to be benefits of using NULL
in terms of performance.
Obviously, that's just a benchmark, which doesn't necessarily reflect real-world use-cases, but it should give you a hint of a potential difference.
Benchmark code
For reproduction or adaptations, here's a benchmark, based on this technique.
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 10000;
rec RECORD;
run INT[];
stmt INT[];
elapsed DECIMAL[];
min_elapsed DECIMAL;
i INT := 1;
-- Store the vector in a local variable to avoid re-computing it in the benchmark
v tsvector = to_tsvector('');
BEGIN
-- Repeat the whole benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Statement 1
SELECT v @@ to_tsquery('cat & rat')
) LOOP
NULL;
END LOOP;
END LOOP;
run[i] := r;
stmt[i] := 1;
elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
- EXTRACT(EPOCH FROM v_ts));
i := i + 1;
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Statement 2
SELECT NULL @@ to_tsquery('cat & rat')
) LOOP
NULL;
END LOOP;
END LOOP;
run[i] := r;
stmt[i] := 2;
elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
- EXTRACT(EPOCH FROM v_ts));
i := i + 1;
END LOOP;
SELECT min(t.elapsed)
INTO min_elapsed
FROM unnest(elapsed) AS t(elapsed);
FOR i IN 1..array_length(run, 1) LOOP
RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i],
CAST(elapsed[i] / min_elapsed AS DECIMAL(10, 5));
END LOOP;
END$$;
Upvotes: 3