mindplay.dk
mindplay.dk

Reputation: 7379

Should I store empty tsvector values or NULL values?

When storing a tsvector value in a column, for records with no search terms, should I store an empty tsvector or a NULL value?

Does it matter?

Is there any difference in terms of performance, or storage overhead from storing empty vectors?

In other words, when updating the vector based on a value of, say, a nullable title column, do I need to always compute this as to_tsvector(coalesce(title,'')) (since to_tsvector returns NULL when given a NULL argument) or is it enough to do to_tsvector(title)?

Upvotes: 16

Views: 2408

Answers (1)

Lukas Eder
Lukas Eder

Reputation: 221135

Logic aspects of your question

First off, the semantics of SQL NULL is that of UNKNOWN, whereas some data types also have an "empty" value. Those data types include:

  • TEXT ('' isn't the same as NULL::TEXT)
  • JSON and JSONB ([] or {} aren't the same as NULL::JSON or NULL:JSONB)
  • X[] (ARRAY[]::X[] isn't the same as NULL::X[])

There are many more, including TSVECTOR. The semantics of an empty collection of something is always subtly different from that of a NULL value, which is an UNKNOWN collection (often just used as an absent collection, though). The distiction manifests specifically when it comes to using operators, e.g.

  • '' || 'abc' = 'abc' but NULL || 'abc' IS NULL
  • to_tsvector('cats ate rats') @@ to_tsquery('cat & rat') = true but NULL @@ to_tsquery('cat & rat') IS NULL

In that sense, the decision should be foremost a logic one, not a storage one, based on this question: Will you still work with the TSVECTOR value of a record, even if the record doesn't have any search terms (pro empty TSVECTOR)? Or does the feature not apply at all to that particular record (pro NULL value)? For the @@ operator, it may not be so relevant, but it definitely is for the || operator, and others.

The answer isn't obvious, nor is there a clear right / wrong way in general.

Performance aspects of your question

If this is a highly performance sensitive situation in your application (e.g. you have a lot of empty TSVECTOR values), then maybe, this benchmark could help you with the decision?

I ran the below benchmark on PostgreSQL 14.1 in Docker to get this result:

RUN 1, Statement 1: 2.91145
RUN 1, Statement 2: 1.00000 -- The fastest run is 1. The others are multiples of 1
RUN 2, Statement 1: 2.80509
RUN 2, Statement 2: 1.05232
RUN 3, Statement 1: 2.78001
RUN 3, Statement 2: 1.00202
RUN 4, Statement 1: 2.74319
RUN 4, Statement 2: 1.00524
RUN 5, Statement 1: 2.75808
RUN 5, Statement 2: 1.00045
  • Statement 1 is SELECT v @@ to_tsquery('cat & rat') with v tsvector = to_tsvector('');
  • Statement 2 is SELECT NULL @@ to_tsquery('cat & rat')

The fact that NULL is involved probably leads to a shortcut in the @@ operator's algorithm, which produces a 2.7x performance improvement over querying an empty TSVECTOR in the benchmark. So, there do seem to be benefits of using NULL in terms of performance.

Obviously, that's just a benchmark, which doesn't necessarily reflect real-world use-cases, but it should give you a hint of a potential difference.

Benchmark code

For reproduction or adaptations, here's a benchmark, based on this technique.

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
  run INT[];
  stmt INT[];
  elapsed DECIMAL[];
  min_elapsed DECIMAL;
  i INT := 1;

  -- Store the vector in a local variable to avoid re-computing it in the benchmark
  v tsvector = to_tsvector('');
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        -- Statement 1
        SELECT v @@ to_tsquery('cat & rat')
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 1;
    elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
                 - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        -- Statement 2
        SELECT NULL @@ to_tsquery('cat & rat') 
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    run[i] := r;
    stmt[i] := 2;
    elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
                 - EXTRACT(EPOCH FROM v_ts));
    i := i + 1;
  END LOOP;

  SELECT min(t.elapsed)
  INTO min_elapsed
  FROM unnest(elapsed) AS t(elapsed);

  FOR i IN 1..array_length(run, 1) LOOP
    RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i], 
      CAST(elapsed[i] / min_elapsed AS DECIMAL(10, 5));
  END LOOP;
END$$;

Upvotes: 3

Related Questions