darrensunley
darrensunley

Reputation: 49

Postgres - character sets and encodings

I was wondering if someone can help me understand what's going on/wrong with my Postgres data please...

I'll explain things below - but I guess ultimately the questions I have are :

I've got a database that has a table with data about flights (although obviously it could be anything really), defined as follows...

CREATE TABLE public.flight (
    flightid integer DEFAULT nextval('public.flight_seq'::regclass) NOT NULL,
    tripid integer NOT NULL,
    flightdatedeparted date NOT NULL,
    flightairportdeparted text NOT NULL,
    flightairportarrived text NOT NULL,
    flightairline text NOT NULL,
    flightdetails text,
    flightdayflightnumber integer DEFAULT 1 NOT NULL,
    flightdistance numeric
);

Now, when I enter data into it via a web front end connected to this database then I end up with data something like...

holidayinfo=# select distinct * from flight where flightid=97;
-[ RECORD 1 ]---------+---------------------------------------
flightid              | 97
tripid                | 36
flightdatedeparted    | 2004-05-14
flightairportdeparted | LHR
flightairportarrived  | WAW
flightairline         | British Airways
flightdetails         | Hotline, £82.40, BA850, 13:40 -> 17:05
flightdayflightnumber | 1
flightdistance        | 912.7

However, the data that I'd entered into the web form for the field "flightdetails" was actually...

Hotline, £82.40, BA850, 13:40 -> 17:05

Now, when I dump the data and look at it in Notepad++, depending on what encoding I use then sometimes I see it correctly as the pound symbol (when I choose ANSI) and other times it's incorrect as xA3 (when I choose UTF8).

At least when it's stored in Postgres as the "funny" value then it also displays correctly on my webpage when I retrieve the data - so that's good.

If I try to manually update the value via psql then I get the following...

holidayinfo=# update flight set flightdetails='Hotline, £82.40, BA850, 13:40 -> 17:05' where flightid=97;
ERROR:  invalid byte sequence for encoding "UTF8": 0x9c

In terms of how my database is created and what client encoding its using then I've got the following...

holidayinfo=# \l
                                                   List of databases
    Name     |  Owner   | Encoding |           Collate           |            Ctype            |   Access privileges
-------------+----------+----------+-----------------------------+-----------------------------+-----------------------
 holidayinfo | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 |
 leagueinfo  | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 |
 postgres    | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 |
 template0   | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 | =c/postgres          +
             |          |          |                             |                             | postgres=CTc/postgres
 template1   | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 | =c/postgres          +
             |          |          |                             |                             | postgres=CTc/postgres
(5 rows)


holidayinfo=# show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

Maybe this is all working as designed, but I'm just confused as to how things should be?

Ultimately, I'd love to be able to have the data stored so that I can see it as the pound sign AND be entered/retrieved/displayed as the pound sign.

The former is desirable so that if ever I need to look at the data then I can see what the real data is - not have to make assumptions on what character "£" really means.

Also, this problem scales up when there are other characters having the same "issue" such as a hyphen (-) showing as "ÔÇô" and or an apostrophe (') showing as "ÔÇÖ".

Thanks in advance!

Upvotes: -1

Views: 3348

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 248215

You must be viewing the data with psql using cmd.exe with code page CP-850.

The data in your database are wrong, because the application that inserted them had client_encoding set to WIN1252 while feeding the database UTF-8 characters.

So £, which is 0xC2A3 in UTF-8, is interpreted as two characters, namely  (0xC2) and £ (0xA3). They are converted to UTF-8 and stored in the database as 4 bytes (0xC382 and 0xC2A3). When you view them with psql, they are converted back to WINDOWS-1252, but cmd.exe interprets them as CP-850 and renders them as ┬ú.

The fix is to change client_encoding to UTF8 in the application that inserts the data into the database.

Upvotes: 5

Related Questions