david_adler
david_adler

Reputation: 10912

When to use multi column indexes?

I have the following table with about 10 million rows

CREATE TABLE "autocomplete_books"
(
    id uuid PRIMARY KEY DEFAULT uuid_generate_v4 (),
    author_id uuid NOT NULL REFERENCES "author"(id) ON DELETE CASCADE,
    language VARCHAR(30) NOT NULL,
    name VARCHAR(100) NOT NULL,
    importance_rank SMALLINT NOT NULL DEFAULT 1
);

I have the following query

SELECT DISTINCT ON (author_id)
    author_id,
    similarity(name, $1) as score,
    language, name, importance_rank
FROM
    "autocomplete_books"
WHERE
    $1 % name AND language IN ($2, $3, $4)
ORDER BY
    author_id, score DESC, importance_rank DESC
LIMIT
    10

I am querying primarily on similarity as this is an autocomplete endpoint, so I have a trigram index on name. I am also sorting on some other fields. I am not sure how the score field will mix with my other indexes and whether it is better to have a compound index like so

Option 1

CREATE INDEX ON "autocomplete_books" USING GIN (name gin_trgm_ops);
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language, importance_rank DESC);

Or if I should break them out like so

Option 2

CREATE INDEX ON "autocomplete_books" USING GIN (name gin_trgm_ops);
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language, importance_rank DESC);
CREATE INDEX ON "autocomplete_books" USING BTREE (language);
CREATE INDEX ON "autocomplete_books" USING BTREE (importance_rank DESC);

Here is the output of explain analyze ran on 220k rows with the following index

CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language);
CREATE INDEX ON "autocomplete_books" USING BTREE (importance_rank DESC);

-

Limit  (cost=762.13..762.38 rows=50 width=82) (actual time=12.230..13.024 rows=50 loops=1)
  ->  Unique  (cost=762.13..763.23 rows=217 width=82) (actual time=12.223..12.686 rows=50 loops=1)
        ->  Sort  (cost=762.13..762.68 rows=220 width=82) (actual time=12.216..12.373 rows=50 loops=1)
              Sort Key: author_id, ((similarity((name)::text, \'sale\'::text)) DESC, importance_rank DESC
              Sort Method: quicksort  Memory: 45kB
              ->  Bitmap Heap Scan on "books_autocomplete" mat  (cost=45.71..753.57 rows=220 width=82) (actual time=1.905..11.610 rows=149 loops=1)
                    Recheck Cond: (\'sale\'::text % (name)::text)
                    Rows Removed by Index Recheck: 2837
                    Filter: ((language)::text = ANY (\'{language1,language2,language3}\'::text[]))
                    Heap Blocks: exact=2078
                    ->  Bitmap Index Scan on "books_autocomplete_name_idx"  (cost=0.00..45.65 rows=220 width=0) (actual time=1.551..1.557 rows=2986 loops=1)
                          Index Cond: (\'sale\'::text % (name)::text)
Planning time: 13.976 ms
Execution time: 13.545 ms'

Upvotes: 4

Views: 429

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246393

An index will only help you with sorting if all expressions in the ORDER BY clause are in the index, and you can't do that because of the second expression.

Also, only b-tree indexes are useful for supporting ORDER BY. Now you cannot combine multiple indexes when you want to use ORDER BY, and you say that $1 % name is your most selective criterion, so you probably want to use an index on that.

There are two ways this query can take:

  1. Go for the $1 % name condition with a trigram GIN index on name. This is what the execution plan in your question does. Then you'll have to live with that Sort, because you can't use an index for it. The danger here is that the bitmap index scan will find so many rows that the bitmap heap scan is quite expensive.

  2. If there is an index that is an exact match for the ORDER BY clause:

    CREATE INDEX ON autocomplete_books
       (author_id, score DESC, importance_rank DESC);
    

    you can scan the index and fetch rows in ORDER BY order until you have 10 that match the filter condition $1 % name. The danger here is that it may take longer than expected to find the 10 rows.

Try first with only the one index, then with only the other index and run the query with different parameters on a data set of realistic size to see what works out best.

You should drop all other indexes than these two, because they won't do any good for this query.

If one of the two strategies is a clear winner, drop the other index so the optimizer is not tempted to use it. Otherwise keep both and hope that the optimizer picks the right one depending on the parameters.

Upvotes: 3

Related Questions