Diego Alves
Diego Alves

Reputation: 2667

A good way to search for similar words in PostgreSQL

I am using PostgreSQL 10 and, even though I have some experience writing queries, I don't now that much about DBMSs. I want that my query be more "intelligent" and not just match exact words. I made a humble research, and it seems that to accomplish this we have many features. First, I thought about using RegEx on the query, but it seems not a good idea, because RegEx are fairly limited. I am sure that Postgres has some better feature more suitable for this.

To implement my smart search I came up with an idea like this: The user can forget a letter in each word or have repeated letters that will not cause a mismatch. Also letters like ç and c should match. This idea is not the perfect reproduction of the problem that I have to solve I am only exposing my problem.

For example:

If a user searches for "macs vermelhas". I should return results containing "macas vermelhas", "maças vermelhas", "macs vermelhas", "macs vrmelhas".

The system is in Portuguese and I have pretty much the same search challenges of Spanish language (ê, ü, ã and etc).

Upvotes: 1

Views: 3225

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246443

I think that trigram indexes are a perfect fit for you.

CREATE EXTENSION pg_trgm;

Let's have a table like that:

CREATE TABLE words (word text);

INSERT INTO words VALUES ('maças vermelhas');
INSERT INTO words VALUES ('alguna cosa');

CREATE INDEX ON words USING gin (word gin_trgm_ops);

Now you can find words by similarity:

SELECT * FROM words WHERE word % 'macs vermelhas';

      word       
-----------------
 maças vermelhas
(1 row)

The index can be used (I force it in this case):

SET enable_seqscan=off;

EXPLAIN (COSTS OFF)
   SELECT * FROM words WHERE word % 'macs vermelhas';

                     QUERY PLAN                      
-----------------------------------------------------
 Bitmap Heap Scan on words
   Recheck Cond: (word % 'macs vermelhas'::text)
   ->  Bitmap Index Scan on words_word_idx
         Index Cond: (word % 'macs vermelhas'::text)
(4 rows)

Upvotes: 2

Related Questions