Reputation: 12581
I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á
instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1
, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"
Upvotes: 20
Views: 6754
Reputation: 422
PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text
, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.
Upvotes: 4
Reputation: 325091
I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/[email protected]
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:
Upvotes: 15