Should I store empty tsvector values or NULL values?

Question

When storing a tsvector value in a column, for records with no search terms, should I store an empty tsvector or a NULL value?

Does it matter?

Is there any difference in terms of performance, or storage overhead from storing empty vectors?

In other words, when updating the vector based on a value of, say, a nullable title column, do I need to always compute this as to_tsvector(coalesce(title,'')) (since to_tsvector returns NULL when given a NULL argument) or is it enough to do to_tsvector(title)?

Lukas Eder · Accepted Answer

Logic aspects of your question

First off, the semantics of SQL NULL is that of UNKNOWN, whereas some data types also have an "empty" value. Those data types include:

TEXT ('' isn't the same as NULL::TEXT)
JSON and JSONB ([] or {} aren't the same as NULL::JSON or NULL:JSONB)
X[] (ARRAY[]::X[] isn't the same as NULL::X[])

There are many more, including TSVECTOR. The semantics of an empty collection of something is always subtly different from that of a NULL value, which is an UNKNOWN collection (often just used as an absent collection, though). The distiction manifests specifically when it comes to using operators, e.g.

'' || 'abc' = 'abc' but NULL || 'abc' IS NULL
to_tsvector('cats ate rats') @@ to_tsquery('cat & rat') = true but NULL @@ to_tsquery('cat & rat') IS NULL

In that sense, the decision should be foremost a logic one, not a storage one, based on this question: Will you still work with the TSVECTOR value of a record, even if the record doesn't have any search terms (pro empty TSVECTOR)? Or does the feature not apply at all to that particular record (pro NULL value)? For the @@ operator, it may not be so relevant, but it definitely is for the || operator, and others.

The answer isn't obvious, nor is there a clear right / wrong way in general.

Performance aspects of your question

If this is a highly performance sensitive situation in your application (e.g. you have a lot of empty TSVECTOR values), then maybe, this benchmark could help you with the decision?

I ran the below benchmark on PostgreSQL 14.1 in Docker to get this result:

RUN 1, Statement 1: 2.91145
RUN 1, Statement 2: 1.00000 -- The fastest run is 1. The others are multiples of 1
RUN 2, Statement 1: 2.80509
RUN 2, Statement 2: 1.05232
RUN 3, Statement 1: 2.78001
RUN 3, Statement 2: 1.00202
RUN 4, Statement 1: 2.74319
RUN 4, Statement 2: 1.00524
RUN 5, Statement 1: 2.75808
RUN 5, Statement 2: 1.00045

Statement 1 is SELECT v @@ to_tsquery('cat & rat') with v tsvector = to_tsvector('');
Statement 2 is SELECT NULL @@ to_tsquery('cat & rat')

The fact that NULL is involved probably leads to a shortcut in the @@ operator's algorithm, which produces a 2.7x performance improvement over querying an empty TSVECTOR in the benchmark. So, there do seem to be benefits of using NULL in terms of performance.

Obviously, that's just a benchmark, which doesn't necessarily reflect real-world use-cases, but it should give you a hint of a potential difference.

Benchmark code

For reproduction or adaptations, here's a benchmark, based on this technique.

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
  run INT[];
  stmt INT[];
  elapsed DECIMAL[];
  min_elapsed DECIMAL;
  i INT := 1;

  -- Store the vector in a local variable to avoid re-computing it in the benchmark
  v tsvector = to_tsvector('');
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        -- Statement 1
        SELECT v @@ to_tsquery('cat & rat')
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 1;
    elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
                 - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        -- Statement 2
        SELECT NULL @@ to_tsquery('cat & rat') 
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 2;
    elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
                 - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
  END LOOP;

  SELECT min(t.elapsed)
  INTO min_elapsed
  FROM unnest(elapsed) AS t(elapsed);

  FOR i IN 1..array_length(run, 1) LOOP
    RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i], 
      CAST(elapsed[i] / min_elapsed AS DECIMAL(10, 5));
  END LOOP;
END$$;

Should I store empty tsvector values or NULL values?

Answers (1)

Logic aspects of your question

Performance aspects of your question

Related Questions