Tomer
Tomer

Reputation: 630

Postgres Text Search with Additional Words/Tokens

I have a table with English sentences. Given a sentence which may contain an additional word, or a distorted word, can I find the closest sentence in the table using Postgres' Text-Search capabilities?

to_tsvector('a b c') @@ plainto_tsquery('a b') returns true

to_tsvector('a b') @@ plainto_tsquery('a b c') returns false

I would like scenario 2 to return true as well.

Notes:

  1. The length of the sentences may be dozens of words. I'm looking for an efficient solution..
  2. Other text search engines such as Elastic/Solr will successfully return the closest result.

More information regrading performance of the trigram index.

EXPLAIN (ANALYSE, BUFFERS)
SELECT
    similarity(title, 'electode paste composition') as sml,
    title
FROM
    table
WHERE
    title % 'electode paste composition'
ORDER BY
    sml DESC;

returns:

Gather Merge  (cost=1880112.22..1902381.94 rows=190870 width=93) (actual time=36355.303..36356.143 rows=5 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=407649
  ->  Sort  (cost=1879112.20..1879350.78 rows=95435 width=93) (actual time=36344.180..36344.180 rows=2 loops=3)
        Sort Key: (similarity(title, 'electode paste composition'::text)) DESC"
        Sort Method: quicksort  Memory: 25kB
        Worker 0:  Sort Method: quicksort  Memory: 25kB
        Worker 1:  Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=407649
        ->  Parallel Bitmap Heap Scan on table  (cost=2759.10..1866325.66 rows=95435 width=93) (actual time=35940.284..36344.141 rows=2 loops=3)
              Recheck Cond: (title % 'electode paste composition'::text)"
              Rows Removed by Index Recheck: 14904
              Heap Blocks: exact=16199
              Buffers: shared hit=407635
              ->  Bitmap Index Scan on title_trgm  (cost=0.00..2701.84 rows=229045 width=0) (actual time=35543.907..35543.907 rows=44716 loops=1)
                    Index Cond: (title % 'electode paste composition'::text)"
                    Buffers: shared hit=362988
Planning Time: 0.084 ms
Execution Time: 36356.187 ms

The same query using a tsquery takes less than 2.5 seconds.

Upvotes: 0

Views: 290

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246093

Assuming that your words are longer than one character, I'd recommend trigram indexes:

CREATE EXTENSION pg_trgm;

CREATE INDEX ON atable USING gin (textcol gin_trgm_ops);

SELECT * FROM atable WHERE textcol % 'search string';

% is the similarity operator.

Upvotes: 1

Related Questions