David Senkus
David Senkus

Reputation: 280

Postgres Full Text Search using TSVECTOR and special UTF8 characters

I have a database of instagram posts which needs to be searchable. Current implementation uses TSVECTOR column and @@ operator to search against it:

CREATE TABLE ig_posts (
 /* ... */
 caption_tsv TSVECTOR
);

Posts insert SQL:

INSERT INTO ig_posts(caption_tsv)
VALUES (to_tsvector('simple', ?)) /* simple, because post can be in any language */

Search SQL:

SELECT *
FROM ig_posts
WHERE ig_posts.caption_tsv @@ to_tsquery('simple', ?)

Everything works great, except some posts might contain utf8 encoded bold/italic text, for example:

'rosegold':26 'sunshine':23 '𝑨𝒍𝒘𝒂𝒚𝒔':1 '𝑯𝒆𝒂𝒓𝒕':4 '𝒊𝒏':2 '𝒎𝒚':3 '𝗗𝗢𝗖𝗘𝗡𝗔𝟭𝟱':21 '𝗗𝗮𝘆':17 '𝗠𝗼𝘁𝗵𝗲𝗿':15

This causes the @@ operator to fail, since the "Always" and "𝑨𝒍𝒘𝒂𝒚𝒔" are considered different words. Is it possible to covert bold/italic UTF8 text back to normal within Postgres?


Edit: the accepted answer works perfectly, here is the rules file that I used: https://gist.github.com/dsenkus/9250134b338b9862585963dfdf450e13

And the script that was used to generate it: https://gist.github.com/dsenkus/17ddbd29d53bd6dd2cf4051bd05ad44d (generated results requires some cleanup, since some special letters do not exist in UTF8, and 3rd h letter is not recognized, ℎ should be used as value)

Upvotes: 2

Views: 2549

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246298

You need the unaccent contrib module:

CREATE EXTENSION unaccent;

Then you need to create a file my_unaccent.rules in the tsearch_data subdirectory of the PostgreSQL share directory that contains the character-wise mapping you need. This is something you have to build yourself.

Then you can create a dictionary using it:

CREATE TEXT SEARCH DICTIONARY my_unaccent (TEMPLATE = unaccent, RULES = 'my_unaccent');

Then you can create a new text search configuration based on simple that uses that dictionary:

CREATE TEXT SEARCH CONFIGURATION my_unaccent (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION my_unaccent
   ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
      WITH my_unaccent, simple;

This full text search configuration should so what you need.

Upvotes: 5

Related Questions