Reputation: 9122
Is there any way in PostgreSQL to convert UTF-8 characters to "similar" ASCII characters?
String glāžšķūņu rūķīši
would have to be converted to glazskunu rukisi
. UTF-8 text is not in some specific language, it might be in Latvian, Russian, English, Italian or any other language.
This is needed for using in where
clause, so it might be just "comparing strings" rather than "converting strings".
I tried using convert
, but it does not give desired results (e.g., select convert('Ā', 'utf8', 'sql_ascii')
gives \304\200
, not A
).
Database is created with:
ENCODING = 'UTF8'
LC_COLLATE = 'Latvian_Latvia.1257'
LC_CTYPE = 'Latvian_Latvia.1257'
These params may be changed, if necessary.
Upvotes: 2
Views: 4111
Reputation: 26910
Use pg_collkey()
for ICU supported unicode compare:
- http://www.public-software-group.org/pg_collkey
- http://russ.garrett.co.uk/tag/postgresql/
Upvotes: 2
Reputation: 3190
I found different ways to do this on the PostgreSQL Wiki.
In plperl:
CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;
In pure SQL:
CREATE OR REPLACE FUNCTION unaccent_string(text)
RETURNS text
IMMUTABLE
STRICT
LANGUAGE SQL
AS $$
SELECT translate(
$1,
'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
);
$$;
And in plpython:
create or replace function unaccent(text) returns text language plpythonu as $$
import unicodedata
rv = plpy.execute("select setting from pg_settings where name = 'server_encoding'");
encoding = rv[0]["setting"]
s = args[0].decode(encoding)
s = unicodedata.normalize("NFKD", s)
s = ''.join(c for c in s if ord(c) < 127)
return s
$$;
In your case, a translate() call with all the characters you can find in the UTF-8 table should be enough.
Upvotes: 3