binaryLV
binaryLV

Reputation: 9122

comparing strings in PostgreSQL

Is there any way in PostgreSQL to convert UTF-8 characters to "similar" ASCII characters?

String glāžšķūņu rūķīši would have to be converted to glazskunu rukisi. UTF-8 text is not in some specific language, it might be in Latvian, Russian, English, Italian or any other language.

This is needed for using in where clause, so it might be just "comparing strings" rather than "converting strings".

I tried using convert, but it does not give desired results (e.g., select convert('Ā', 'utf8', 'sql_ascii') gives \304\200, not A).

Database is created with:

ENCODING = 'UTF8'
LC_COLLATE = 'Latvian_Latvia.1257'
LC_CTYPE = 'Latvian_Latvia.1257'

These params may be changed, if necessary.

Upvotes: 2

Views: 4111

Answers (2)

J-16 SDiZ
J-16 SDiZ

Reputation: 26910

Use pg_collkey() for ICU supported unicode compare: - http://www.public-software-group.org/pg_collkey - http://russ.garrett.co.uk/tag/postgresql/

Upvotes: 2

analogue
analogue

Reputation: 3190

I found different ways to do this on the PostgreSQL Wiki.

In plperl:

CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;

In pure SQL:

CREATE OR REPLACE FUNCTION unaccent_string(text)
RETURNS text
IMMUTABLE
STRICT
LANGUAGE SQL
AS $$
SELECT translate(
    $1,
    'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
    'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
);
$$;

And in plpython:

create or replace function unaccent(text) returns text language plpythonu as $$
import unicodedata
rv = plpy.execute("select setting from pg_settings where name = 'server_encoding'");
encoding = rv[0]["setting"]
s = args[0].decode(encoding)
s = unicodedata.normalize("NFKD", s)
s = ''.join(c for c in s if ord(c) < 127)
return s
$$;

In your case, a translate() call with all the characters you can find in the UTF-8 table should be enough.

Upvotes: 3

Related Questions