Reputation: 32104
:) I'm trying to find strings similarity in PostgreSQL 10.1.
I use the extensions unaccent
and pg_trgm
and enabled them with:
create extension unaccent;
create extension pg_trgm;
The problem is that pg_trgm
does not provide utf8 support. So if I execute:
select similarity('כפיר','כפיר');
it returns similarity of zero.
I decided to convert that string to an iso-8859-8
encoding which supports Hebrew and English, the languages that I use in this case.
But first I want to unaccent
the string, so if I have כפיר - ƒ
it will first be converted to כפיר - f
so it will be properly converted to the desired character encoding.
So select unaccent('כפיר - ƒ');
does return the proper results.
So unfortunately executing:
select convert(unaccent('כפיר - ƒ'),'UTF8','ISO_8859_8');
returns
[42883] ERROR: function convert(text, unknown, unknown) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 8
When I check the documentation at https://www.postgresql.org/docs/9.1/functions-string.html regarding convert()
function it requires a string.
If I try to cast something as string, I get that type "string" does not exist
OK. So working with unaccent is the first issue that I need to resolve (and maybe the only issue). What I did next returned proper results without me seeing the proper string in hebrew.
What I mean is this..
Executing select convert('כפיר','UTF-8','ISO_8859_8');
returns
4 B 00000000 EB F4 E9 F8 ëôéø
and casting it to text with select convert('כפיר','UTF8','ISO_8859_8')::text;
returns \xebf4e9f8
I'm guessing these are the write characters and because of the change in encoding I don't see the actual Hebrew characters. am I wrong to assume that ? that's the 2nd issue.
So if I don't use accent and execute:
select similarity(convert('כפייר עזר','UTF8','ISO_8859_8')::text, convert('כפיר','UTF8','ISO_8859_8')::text);
it returns a similarity of 0.5 which is OK.
To sum up, my questions are:
unaccent
properly to be able to use it in convert
regexp_replace('string', '\W+', '', 'g')
. Is that the way to go? Is there a regexp_replace that supports utf8 ?Thank You!
Upvotes: 1
Views: 1627
Reputation: 32104
ok so the solution was simple.
needed to convert it to byte-array (bytea
) and then back to text.
and regarding the regular expression replace, I should have used the specific characters that I wanted to remove instead of using \W+
.
so the solution for my case is:
select
similarity(convert(unaccent(regexp_replace(lower('string'), '[.,''׳`"-]', '', 'g'))::bytea,'UTF8','ISO_8859_8')::text,
convert(unaccent(regexp_replace(lower('string'), '[.,''׳`"-]', '', 'g'))::bytea,'UTF8','ISO_8859_8')::text)
Upvotes: 1