urschrei
urschrei

Reputation: 26859

Postgres storing UTF-8 values incorrectly upon import

On Postgres 9.1 I'm importing data only (no structure) into a UTF-8 DB containing no records

psql -l gives me

  Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 glyph     | glyph    | UTF8     | en_GB.UTF-8 | en_GB.UTF-8 | =Tc/glyph            +
           |          |          |             |             | glyph=CTc/glyph

The table looks like so:

-- Table: sign

-- DROP TABLE sign;

CREATE TABLE sign
(
  id serial NOT NULL,
  sign_ref character varying(150) NOT NULL,
  CONSTRAINT pk_sign PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE sign
  OWNER TO glyph;

-- Index: ix_sign_sign_ref

-- DROP INDEX ix_sign_sign_ref;

CREATE UNIQUE INDEX ix_sign_sign_ref
  ON sign
  USING btree
  (sign_ref COLLATE pg_catalog."default");

Value insertion from my SQL file:
Command: psql -h localhost -d glyph -U glyph -f import.sql

START TRANSACTION;
SET standard_conforming_strings=off;
SET escape_string_warning=off;
SET CONSTRAINTS ALL DEFERRED;
INSERT INTO "sign" VALUES (29585,'DE₂');
COMMIT;

However, this has been converted into "DEâ" in the db. What have I done wrong?

EDIT:

select convert_to(
(select sign_ref from sign where id = 29585), 'utf8'); 

Gives me:
DE\303\242\302\202\302\202

Upvotes: 2

Views: 3302

Answers (1)

Daniel Vérité
Daniel Vérité

Reputation: 61546

The problem is that the input (the INSERT query) was encoded in UTF-8 but interpreted as LATIN1 (or LATIN9), probably because of a wrong client_encoding setting.

The line of reasoning is:

  1. The byte representation of DE₂ in utf-8 is 44 45 e2 82 82 (in hex)
  2. When interpreted in LATIN1, these bytes mean DEâ and two control characters 0x82. That's what enters the database.
  3. When re-reading this with as normal utf-8, what gets out is â encoded in utf-8 followed by two U+0082 encoded in utf-8. In hex bytes, that's 44 45 c3 a2 c2 82 c2 82. In octal, that's DE\303\242\302\202\302\202 which is precisely what you're getting with convert_to(sign_ref, 'utf-8').

Demo with a terminal in utf-8:

$ cat > test.sql
create table test(t text);
set client_encoding to latin1;  -- intentional mistake for the purpose of demoing
insert into test values('DE₂');

$ psql -d test -f test.sql

$ psql -d test
select * from test;
        t        
-----------------
 DEâ\u0082\u0082

set bytea_output='escape';

select convert_to(t, 'utf-8') from test;
         convert_to         
----------------------------
 DE\303\242\302\202\302\202

Upvotes: 3

Related Questions