ufk
ufk

Reputation: 32104

utf-8 string similarity in postgresql 10.1

:) I'm trying to find strings similarity in PostgreSQL 10.1.

I use the extensions unaccent and pg_trgm and enabled them with:

create extension unaccent;
create extension pg_trgm;

The problem is that pg_trgm does not provide utf8 support. So if I execute:

select similarity('כפיר','כפיר');

it returns similarity of zero.

I decided to convert that string to an iso-8859-8 encoding which supports Hebrew and English, the languages that I use in this case.

But first I want to unaccent the string, so if I have כפיר - ƒ it will first be converted to כפיר - f so it will be properly converted to the desired character encoding.

So select unaccent('כפיר - ƒ'); does return the proper results.

So unfortunately executing:

select convert(unaccent('כפיר - ƒ'),'UTF8','ISO_8859_8');

returns

[42883] ERROR: function convert(text, unknown, unknown) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 8

When I check the documentation at https://www.postgresql.org/docs/9.1/functions-string.html regarding convert() function it requires a string.

If I try to cast something as string, I get that type "string" does not exist

OK. So working with unaccent is the first issue that I need to resolve (and maybe the only issue). What I did next returned proper results without me seeing the proper string in hebrew.

What I mean is this..

Executing select convert('כפיר','UTF-8','ISO_8859_8'); returns

4 B 00000000  EB F4 E9 F8                                        ëôéø

and casting it to text with select convert('כפיר','UTF8','ISO_8859_8')::text; returns \xebf4e9f8

I'm guessing these are the write characters and because of the change in encoding I don't see the actual Hebrew characters. am I wrong to assume that ? that's the 2nd issue.

So if I don't use accent and execute:

select similarity(convert('כפייר עזר','UTF8','ISO_8859_8')::text, convert('כפיר','UTF8','ISO_8859_8')::text);

it returns a similarity of 0.5 which is OK.

To sum up, my questions are:

  1. How to cast unaccent properly to be able to use it in convert
  2. Am I casting my utf-8 Hebrew string properly to ASCII?
  3. Am I trying to resolve the issue the right way - is there a different extension for similarity function that supports utf-8?
  4. I also want to remove any character that is not alphanumeric from the string before I do execute the similarity function in order to get better similarity results based on my needs. I think about using regular expressions about converting the string to ASCII. something like: regexp_replace('string', '\W+', '', 'g'). Is that the way to go? Is there a regexp_replace that supports utf8 ?

Thank You!

Upvotes: 1

Views: 1627

Answers (1)

ufk
ufk

Reputation: 32104

ok so the solution was simple.

needed to convert it to byte-array (bytea) and then back to text. and regarding the regular expression replace, I should have used the specific characters that I wanted to remove instead of using \W+.

so the solution for my case is:

   select
  similarity(convert(unaccent(regexp_replace(lower('string'), '[.,''׳`"-]', '', 'g'))::bytea,'UTF8','ISO_8859_8')::text,
             convert(unaccent(regexp_replace(lower('string'), '[.,''׳`"-]', '', 'g'))::bytea,'UTF8','ISO_8859_8')::text)

Upvotes: 1

Related Questions