SlideM
SlideM

Reputation: 461

Index for searching nested JSONB array elements in PostgreSQL

I'm fairly new to postgres and currently using 9.6. When trying to implement a full text search in postgres using it's jsonb documents I noticed slow search results for nested arrays. I used the 'explain' command and it didn't use any indices. For simplicity purpose I created a table to investigate:

CREATE TABLE book (
  id   BIGSERIAL NOT NULL,
  data JSONB     NOT NULL
);

My available indices:

CREATE INDEX book_author_idx
  ON book USING GIN (to_tsvector('english', book.data ->> 'author'));
CREATE INDEX book_author_name_idx
  ON book USING GIN (to_tsvector('english', book.data -> 'author' ->> 'name'));

And some data to fill a document:

INSERT INTO book (data)
VALUES (CAST('{"author": [{"id": 0, "name": "Cats"}, ' ||
             '           {"id": 1, "name": "Dogs"}]}' AS JSONB));

I'm able to search for book elements using the following query, however it doesn't use any index. With my actual data of 120k products it takes around 1200ms while other searches with an index take 0.2ms.

EXPLAIN ANALYZE
SELECT
  id,
  data ->> 'author' AS author
FROM book, jsonb_array_elements(data #> '{author}') author_array
WHERE to_tsvector('english', author_array ->> 'name') @@ to_tsquery('cat');

In contrast the next query uses the book_author_name_idx but because of the array structure doesn't find anything.

EXPLAIN ANALYZE
SELECT
  id,
  data ->> 'author' AS author
FROM book
WHERE to_tsvector('english', data -> 'author' ->> 'name') @@ to_tsquery('cat');

How can I tweak my query to use a language index? I'm aware, that I could make a new table for authors and only reference the ids, but I'd rather keep all data in one table for performance.

Upvotes: 2

Views: 1463

Answers (1)

SlideM
SlideM

Reputation: 461

With the tips from posz comments I found a solution. Because the '||' function doesn't work the way I need it, I used a custom concat function for tsvectors. I used the code from glittershark on github and changed to_tsvector from 'default' to 'english' to fit my needs.

CREATE OR REPLACE FUNCTION concat_tsvectors(tsv1 TSVECTOR, tsv2 TSVECTOR)
  RETURNS TSVECTOR AS $$
BEGIN
  RETURN coalesce(tsv1, to_tsvector('english', ''))
         || coalesce(tsv2, to_tsvector('english', ''));
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE tsvector_agg (
BASETYPE = TSVECTOR,
SFUNC = concat_tsvectors,
STYPE = TSVECTOR,
INITCOND = ''
);

Here is the custom function I wrote. Input is data as JSONB and output is a tsvector with aggregated author names.

CREATE OR REPLACE FUNCTION author_function(
  IN  data        JSONB,
  OUT resultNames TSVECTOR
)
  RETURNS TSVECTOR AS $$
DECLARE
  authorRecords   RECORD;
  combinedAuthors JSONB [];
  singleAuthor    JSONB;
BEGIN
  FOR authorRecords IN (SELECT value
                        FROM jsonb_array_elements(data #> '{author}'))
  LOOP
    combinedAuthors := combinedAuthors || authorRecords.value;
  END LOOP;
  FOREACH singleAuthor IN ARRAY coalesce(combinedAuthors, '{}')
  LOOP
    resultNames := concat_tsvectors(resultNames, to_tsvector('english', singleAuthor ->> 'name'));
  END LOOP;
END; $$
LANGUAGE plpgsql
IMMUTABLE;

Then I setup an index for my book objects.

CREATE INDEX book_author_function_idx
  ON book USING GIN (author_function(book.data));

The author names already went through the to_tsvector('english', singleAuthor) function, so I can query for them like this:

EXPLAIN ANALYSE
SELECT
  id,
  data ->> 'author' AS author
FROM book
WHERE author_function(book.data) @@ to_tsquery('cat');

As a result queries for my actual data went from 1100-1200ms to ~0.5ms. I'm not sure if this is the best solution, so if you have better suggestions, please let me know.

Upvotes: 1

Related Questions