Reputation: 17711
I have a small PostgreSQL database (~~3,000 rows).
I'm trying to set up a full text search on one of it's text fields ('body').
The problem is that any query is extremely slow (35+ seconds!!!).
I suppose the problem comes from the fact that the DB chooses a sequential scan mode...
This is my query:
SELECT
ts_rank_cd(to_tsvector('italian', body), query),
ts_headline('italian', body, to_tsquery('torino')),
title,
location,
id_author
FROM
fulltextsearch.documents, to_tsquery('torino') as query
WHERE
(body_tsvector @@ query)
OFFSET
0
This is the EXPLAIN ANALYZE:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1129.81 rows=19 width=468) (actual time=74.059..13630.114 rows=863 loops=1)
-> Nested Loop (cost=0.00..1129.81 rows=19 width=468) (actual time=74.056..13629.342 rows=863 loops=1)
Join Filter: (documents.body_tsvector @@ query.query)
-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32) (actual time=4.606..4.608 rows=1 loops=1)
-> Seq Scan on documents (cost=0.00..1082.09 rows=3809 width=591) (actual time=0.045..48.072 rows=3809 loops=1)
Total runtime: 13630.720 ms
This is my table:
mydb=# \d+ fulltextsearch.documents;
Table "fulltextsearch.documents"
Column | Type | Modifiers | Storage | Description
---------------+-------------------+-----------------------------------------------------------------------+----------+-------------
id | integer | not null default nextval('fulltextsearch.documents_id_seq'::regclass) | plain |
id_author | integer | | plain |
body | character varying | | extended |
title | character varying | | extended |
location | character varying | | extended |
date_creation | date | | plain |
body_tsvector | tsvector | | extended |
Indexes:
"fulltextsearch_documents_tsvector_idx" gin (to_tsvector('italian'::regconfig, COALESCE(body, ''::character varying)::text))
"id_idx" btree (id)
Triggers:
body_tsvectorupdate BEFORE INSERT OR UPDATE ON fulltextsearch.documents FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('body_tsvector', 'pg_catalog.italian', 'body')
Has OIDs: no
I'm sure I'm missing something obvious....
Any clues?
.
.
.
=== UPDATE =======================================================================
Thanks to your suggestions, I came up with this (better) query:
SELECT
ts_rank(body_tsvector, query),
ts_headline('italian', body, query),
title,
location
FROM
fulltextsearch.documents, to_tsquery('italian', 'torino') as query
WHERE
to_tsvector('italian', coalesce(body,'')) @@ query
which is quite better, but always very slow (13+ seconds...).
I notice that commenting out the "ts_headline()" row the query is lightning-fast.
This is the EXPLAIN ANALYZE, which finally uses the index, but doesn't help me much...:
EXPLAIN ANALYZE SELECT
clock_timestamp() - statement_timestamp() as elapsed_time,
ts_rank(body_tsvector, query),
ts_headline('italian', body, query),
title,
location
FROM
fulltextsearch.documents, to_tsquery('italian', 'torino') as query
WHERE
to_tsvector('italian', coalesce(body,'')) @@ query
Nested Loop (cost=16.15..85.04 rows=19 width=605) (actual time=102.290..13392.161 rows=863 loops=1)
-> Function Scan on query (cost=0.00..0.01 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
-> Bitmap Heap Scan on documents (cost=16.15..84.65 rows=19 width=573) (actual time=0.381..4.236 rows=863 loops=1)
Recheck Cond: (to_tsvector('italian'::regconfig, (COALESCE(body, ''::character varying))::text) @@ query.query)
-> Bitmap Index Scan on fulltextsearch_documents_tsvector_idx (cost=0.00..16.15 rows=19 width=0) (actual time=0.312..0.312 rows=863 loops=1)
Index Cond: (to_tsvector('italian'::regconfig, (COALESCE(body, ''::character varying))::text) @@ query.query)
Total runtime: 13392.717 ms
Upvotes: 6
Views: 5426
Reputation: 432
I solved the problem by precomputing the ts_rank_cd and storing it in a table for popular terms(high occurrences) in the corpus. The search looks at this table to get the sorted doc rank for a query term. if not there (for less popular terms), it will default to creating the ts_rank_cd on the fly.
Please take a look at this post.
https://dba.stackexchange.com/a/149701
Upvotes: 0
Reputation: 17711
At last, with the help of your answers and comments, and with some googling, I did solve by running ts_headline() (a very heavy function, I suppose) on a subset of the full result set (the results page I'm interested in):
SELECT
id,
ts_headline('italian', body, to_tsquery('italian', 'torino')) as headline,
rank,
title,
location
FROM (
SELECT
id,
body,
title,
location,
ts_rank(body_tsvector, query) as rank
FROM
fulltextsearch.documents, to_tsquery('italian', 'torino') as query
WHERE
to_tsvector('italian', coalesce(body,'')) @@ query
LIMIT 10
OFFSET 0
) as s
Upvotes: 2
Reputation: 22893
You're missing two (reasonably obvious) things:
1 You've set 'italian'
in your to_tsvector()
but you aren't specifying it in to_tsquery()
Keep both consistent.
2 You've indexed COALESCE(body, ...)
but that isn't what you're searching against.
The planner isn't magic - you can only use an index if that's what you're searching against.
Upvotes: 5