Reputation: 280
I have a database of instagram posts which needs to be searchable. Current implementation uses TSVECTOR
column and @@
operator to search against it:
CREATE TABLE ig_posts (
/* ... */
caption_tsv TSVECTOR
);
Posts insert SQL:
INSERT INTO ig_posts(caption_tsv)
VALUES (to_tsvector('simple', ?)) /* simple, because post can be in any language */
Search SQL:
SELECT *
FROM ig_posts
WHERE ig_posts.caption_tsv @@ to_tsquery('simple', ?)
Everything works great, except some posts might contain utf8 encoded bold/italic text, for example:
'rosegold':26 'sunshine':23 '𝑨𝒍𝒘𝒂𝒚𝒔':1 '𝑯𝒆𝒂𝒓𝒕':4 '𝒊𝒏':2 '𝒎𝒚':3 '𝗗𝗢𝗖𝗘𝗡𝗔𝟭𝟱':21 '𝗗𝗮𝘆':17 '𝗠𝗼𝘁𝗵𝗲𝗿':15
This causes the @@
operator to fail, since the "Always" and "𝑨𝒍𝒘𝒂𝒚𝒔" are considered different words. Is it possible to covert bold/italic UTF8 text back to normal within Postgres?
Edit: the accepted answer works perfectly, here is the rules file that I used: https://gist.github.com/dsenkus/9250134b338b9862585963dfdf450e13
And the script that was used to generate it: https://gist.github.com/dsenkus/17ddbd29d53bd6dd2cf4051bd05ad44d (generated results requires some cleanup, since some special letters do not exist in UTF8, and 3rd h letter is not recognized, ℎ should be used as value)
Upvotes: 2
Views: 2549
Reputation: 246298
You need the unaccent contrib module:
CREATE EXTENSION unaccent;
Then you need to create a file my_unaccent.rules
in the tsearch_data
subdirectory of the PostgreSQL share directory that contains the character-wise mapping you need. This is something you have to build yourself.
Then you can create a dictionary using it:
CREATE TEXT SEARCH DICTIONARY my_unaccent (TEMPLATE = unaccent, RULES = 'my_unaccent');
Then you can create a new text search configuration based on simple
that uses that dictionary:
CREATE TEXT SEARCH CONFIGURATION my_unaccent (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION my_unaccent
ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
WITH my_unaccent, simple;
This full text search configuration should so what you need.
Upvotes: 5