SRHY15
SRHY15

Reputation: 95

Postgres Full Text Search - Issues with Query Speed

I'm working on incorporating full-text search into my app. In the production version, a user will enter a search phrase which will be searched against 10M+ rows in a table. I'm currently testing it out with a subset of that data (~800k rows) and having some speed issues. When I run this query:

SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank
FROM test_search_articles 
WHERE title_abstract_tsvector @@ to_tsquery('cancer') 
ORDER BY rank LIMIT 50

where 'cancer' is the search term, 25-30 seconds. However, when I change the ORDER BY from rank to id like below:

SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank 
FROM test_search_articles 
WHERE title_abstract_tsvector @@ to_tsquery('cancer') 
ORDER BY id LIMIT 50

the query takes <1sec. I'm confused why changing the ORDER BY accounts for such a huge change in query speed, especially given that rank is returned in both. Could anyone help me understand this and what to do to make the original query faster? Not sure if it's relevant, but I'm currently using a GIN index on my tsvector column (title_abstract_tsvector).

EDIT: Running either query without the LIMITs takes 25-30 seconds, answering my question about why ORDER BY id matters. AS for how to speed the first query, I'm still looking for a solution

EDIT 2: Create Index statements

CREATE UNIQUE INDEX test_search_articles_pkey ON public.test_search_articles USING btree (id)

CREATE INDEX article_idx ON public.test_search_articles USING gin (title_abstract_tsvector)

Execution Plan

"Gather  (cost=1679.97..177072.34 rows=71706 width=103) (actual time=43.963..28084.129 rows=72111 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  Buffers: shared hit=194957 read=97049"
"  I/O Timings: read=80499.893"
"  ->  Parallel Bitmap Heap Scan on test_search_articles  (cost=679.97..168901.74 rows=29878 width=103) (actual time=15.580..28008.573 rows=24037 loops=3)"
"        Recheck Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
"        Heap Blocks: exact=16483"
"        Buffers: shared hit=194957 read=97049"
"        I/O Timings: read=80499.893"
"        ->  Bitmap Index Scan on article_idx  (cost=0.00..662.04 rows=71706 width=0) (actual time=27.719..27.720 rows=72111 loops=1)"
"              Index Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
"              Buffers: shared hit=1 read=20"
"              I/O Timings: read=11.768"
"Planning Time: 12.145 ms"
"Execution Time: 28104.318 ms"

EDIT 3:

select pg_relation_size('test_search_articles'): 2176933888

select pg_table_size('test_search_articles'): 4283850752

pg_column_size of title_abstract_tsvector of entire table: 1343.5673777677141794

pg_column_size of title_abstract_tsvector of rows matching 'cancer' query: 1576.1418923603888450

EDIT 4 Vacuum output message: INFO: vacuuming "public.test_search_articles"

INFO: "test_search_articles": found 0 removable, 1003125 nonremovable row versions in 265739 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU: user: 31.15 s, system: 10.38 s, elapsed: 126.14 s.

INFO: analyzing "public.test_search_articles"

INFO: "test_search_articles": scanned 30000 of 161999 pages, containing 185588 live rows and 0 dead rows; 30000 rows in sample, 1002169 estimated total rows

VACUUM

Upvotes: 2

Views: 5292

Answers (2)

bobflux
bobflux

Reputation: 11591

Quite often the slow search query is useless, which means optimizing it is a waste of time.

If there are 24037 records matching "cancer" then searching on this single term will never return relevant results to the user. Therefore it is pointless to sort by relevance. What could be more useful would be a bunch of heuristics, for example: if the user enters only one search term, display most recent articles about this term (fast) and maybe offer a list of keywords often related to this term. Then, switch to "ORDER BY rank" only when the user enters enough search keywords to produce a meaningful rank. This way you can implement a search that is not just faster, but also more useful.

Maybe you will say, "but if I type a single word into google I get relevant results!" and... yes, of course, but that's because google knows everything you do, it always has context, and if you don't enter extra search terms, it will do it for you.

Upvotes: 6

jjanes
jjanes

Reputation: 44373

Essentially all your time is spent on IO. So the main thing you can do is get faster IO, or more RAM so you can cache more of the data.

The fact that your buffers read is 6 times greater than your exact heap blocks suggests that your title_abstract_tsvector is so large that it has been TOASTED and now needs to reassembled from multiple pages in order to be used in the computation of the rank function. Is that plausible? How large is that column on average?

Are you already saturating your disk capacity? If not you could try to get a larger degree of parallelization by, for example, increasing max_parallel_workers_per_gather.

But the main thing you can do is just not run that query very much. Either don't let users run such a non-specific query, or precompute and store the results of single-term queries so they can be returned without recomputing them.

Upvotes: 2

Related Questions