betamax
betamax

Reputation: 14061

PostgreSQL full text search performance not acceptable when ordering by ts_rank_cd

In my PostgreSQL 9.3 database, I have a table called articles. It looks kind of like this:

+------------+--------------------------------------------------------------+
|    Name    |                        Information                           |
+------------+--------------------------------------------------------------+
| id         | Auto incrememnt integer ID                                   |
| title      | text                                                         |
| category   | character varying(255) with index                            |
| keywords   | String with title and extra words used for indexing          |
| tsv        | Trigger updates w/ tsvector_update_trigger based on keywords |
+------------+--------------------------------------------------------------+

There's more columns in the table but I don't think they aren crucial to the question. The total size of the table is 94GB and about 29M rows.

I'm trying to run a query on a keyword search on a subset of 23M of the article rows. to do this I use the following query:

SELECT title, id FROM articles, plainto_tsquery('dog') AS q 
WHERE (tsv @@ q) AND category = 'animal' 
ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5

The problem with this is that it appears by running ts_rank_cd on each of the results first before it can sort them and therefore this query is very slow, about 2-3 minutes. I've read around a lot to try and find a solution and it was suggested that I wrap the search query in another query so that the ranking is only applied to the found results like so:

SELECT * FROM (
  SELECT title, id, tsv FROM articles, plainto_tsquery('dog') AS q 
  WHERE (tsv @@ q) AND category = 'animal'
) AS t1
ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('dog')) DESC LIMIT 5;

However, because the query is so short, there are 450K results in the subset. So it still takes a long time, it might be a bit quicker but I need this to be essentially instant.

The question: Is there anything I can do to keep this searching functionality within PostgreSQL?

It's nice having this logic kept in the database and means I don't require any extra servers or configuration for something like Solr or Elasticsearch. For example, would increasing the database instance capacity help things? Or would the cost efficiency not make sense when compared to shifting this logic over to a dedicated Elasticsearch instance.


The EXPLAIN response from the first query is as follows:

Limit  (cost=567539.41..567539.42 rows=5 width=465)
  ->  Sort  (cost=567539.41..567853.33 rows=125568 width=465)
        Sort Key: (ts_rank_cd(articles.tsv, q.q))
        ->  Nested Loop  (cost=1769.27..565453.77 rows=125568 width=465)
              ->  Function Scan on plainto_tsquery q  (cost=0.00..0.01 rows=1 width=32)
              ->  Bitmap Heap Scan on articles  (cost=1769.27..563884.17 rows=125567 width=433)
                    Recheck Cond: (tsv @@ q.q)
                    Filter: ((category)::text = 'animal'::text)
                    ->  Bitmap Index Scan on article_search_idx  (cost=0.00..1737.87 rows=163983 width=0)
                          Index Cond: (tsv @@ q.q)

And for the second query:

Aggregate  (cost=565453.77..565453.78 rows=1 width=0)
  ->  Nested Loop  (cost=1769.27..565139.85 rows=125568 width=0)
        ->  Function Scan on plainto_tsquery q  (cost=0.00..0.01 rows=1 width=32)
        ->  Bitmap Heap Scan on articles  (cost=1769.27..563884.17 rows=125567 width=351)
              Recheck Cond: (tsv @@ q.q)
              Filter: ((category)::text = 'animal'::text)
              ->  Bitmap Index Scan on article_search_idx  (cost=0.00..1737.87 rows=163983 width=0)
                    Index Cond: (tsv @@ q.q)

Upvotes: 16

Views: 4172

Answers (3)

benjist
benjist

Reputation: 2881

You simply can't use an index over ts_rank_cd, because the resulting ranking value from it is depending on your query. Therefore all rank values for the whole result set must be computed every time you run a query, before the result set can be sorted and limited by this value.

If your search logic allows you could avoid this bottleneck by precompute a relevance value for each record once, create an index over it, and use this as sort column instead of the cover sensity for each query.

Even though you said you didn't want to, I suggest you look into a search engine that could work together with Postgresql, such as Sphinx. The default BM25 ranker should work fine. You can still set column weights as well, if you have to (http://sphinxsearch.com/docs/current.html#api-func-setfieldweights).

Update: This is also stated in the documentation:

"Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."

See http://www.postgresql.org/docs/8.3/static/textsearch-controls.html

Upvotes: 3

jfalcon
jfalcon

Reputation: 101

You should index the category column and you could try increasing the working memory for this particular query to avoid the Bitmap Heap scan if the category is not what's slowing it:

SET LOCAL work_mem = '64MB';

This could increase your memory usage a lot if the query is executed concurrently.

Upvotes: 0

user3929561
user3929561

Reputation: 41

Maybe... Your clause on category might get optimized away if you use a HASH index, your query on tsv might be optimized using a GIN index, if your category is a (rather small) finite set, maybe you should use an enum for category instead of varying (or at least not use varchar). (I wonder if the weight really matter in your case).

SELECT *
FROM (SELECT *,ts_rank_cd(sub.tsv, plainto_tsquery('dog')) AS rank
    FROM (SELECT title,id,tsv FROM articles WHERE category = 'animal')) AS sub, 
    plainto_tsquery('dog') AS q
WHERE (tsv @@ q)
ORDER BY rank DESC LIMIT 5

Upvotes: 0

Related Questions