Reputation: 141
I want to realize a fulltext search in postgresql combined with a kind of fuzzy search. For my testarea I followed up this article: https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/ and everything is working fine. But sometimes I have search cases with and without spaces in the searach string like this:
In my 'title'-column there is an entry like 'test123'. My searchstring looks like 'test 123' with a space in it. How can I get a hit in this testcase?
My search-sql-query looks like:
SELECT *
FROM test, plainto_tsquery('test:*&123:*') as q
WHERE (tsv @@ q)
result: 0 rows
So I tried to figure out if I can use pg_trgm combined with ts_vector but I can not find a solution. Do you have an idea?
Upvotes: 11
Views: 9950
Reputation: 41
Step 1: let's you have an input query, start by finding similar words in your document as described in the trgm extension documentation. I usually do a custom config for building this table whithout stemming but with language stop words filter. Whith just a simple config, you'll match stop words.
Step 2: rewrite the query with the similar words instead of direct input. Do the textsearch query with the similar words you found.
Step 3: Now is the tricky part, you have to rank/order the results. Keep the original query and re do the query on those results. You could put the records matching with the "next" operator first, After that the "and", the "or" and finally those with none of that (the similiar but different words) and order by built in rank or similarity. You could even have a ranking Logic involving other fields like popularity, freshness or other business related stuff.
You could do
You have full control, but you'll have to sanityze/rewrite user inputs. You'll have to play around with the different parameters/textsearch config.
You can also have multi language support with vectors/indexes in each language config.
You could use materialized view for tables containing the vectorized docs and build indexes on those views. This way, query will be fast ans you wont have insert perf issues with thé original table. you Can use a trigger to refresh thé views concurrently on insert, update or delete.
Upvotes: 0
Reputation: 10335
This is a bit tricky with postgres, but absolutely possible to do in a performant way.
What we'd like to achieve is to be able to search for a series of terms, not necessarily in order, and perhaps misspelled that will match with words in the document.
We want to get the document that has the closest match to the terms we've specified.
For this, I recommend using trigrams via pg_trgm.
What we'll do is create a GIST (or GIN) index on the column we want to search:
create index idx_universal_search on my_table using gist(search_column gist_trgm_ops(siglen=256));
The tricky part is the query. We want to take the input, chop it into a set of tokens, for each token we want to match on our index and get a score. Then we want to aggregate the sum of those scores to get the best overall match. Here's an example of how to do that:
with tokens as
(select unnest(string_to_array('some search phrase',' ')) as t)
select
id,
search_column,
sum(1.0 - (tokens.t <<<-> search)) as score
from my_table, tokens
where tokens.t <<% search
group by id, search_column
order by score desc
limit 10;
In that example, I'm using a CTE to split the input phrase into individual tokens. Then I use a cross join on the search table and sum the individual token scores with strict_word_similary operators (to be able to use the index).
The 1.0 - (tokens.t <<<-> search)
bit is to give a score where the greater value means a better match. the strict_word_similarity operator <<<->
returns the distance between the words, where the lower the value the better the match.
It's not google, but it gives pretty good results.
Upvotes: 2
Reputation: 8272
As the documentation on parsing states:
...plainto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input...
plainto_tsquery
and phraseto_tsquery
are convenience functions which make it easier to search by a full string, but they don't support all of the features. Use to_tsquery
instead which accepts the full search syntax:
SELECT *
FROM test, to_tsquery('testing:* & 123:*') as q
WHERE (tsv @@ q)
This function also requires you to normalize the search query in the same way you normalize the text you're searching by using to_tsvector
, but that's pretty easy with some string functions:
SELECT string_agg(lexeme || ':*', ' & ' order by positions)
FROM unnest(to_tsvector('testing 123'))
This basically gets the individual tokens from to_tsvector
, appends :*
to each, then joins them with &
to create a single string. The example above takes testing 123
and produces testing:* & 123:*
which you can then use directly with to_tsquery
to get fuzzy matching with the normalization intact.
You can combine it all together into a CTE to make it simple:
WITH search AS (
SELECT to_tsquery(string_agg(lexeme || ':*', ' & ' order by positions)) AS query
FROM unnest(to_tsvector('enter your search query here'))
)
SELECT test.*
FROM test, search
WHERE (test.tsv @@ search.query)
This assumes that the table has a tsv
column of datatype tsquery
which is pregenerated instead of creating it on every query (which is much slower). PG12+ supports generated columns which can keep this updated automatically.
Upvotes: 16