Reputation: 26859
On Postgres 9.1 I'm importing data only (no structure) into a UTF-8 DB containing no records
psql -l
gives me
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
glyph | glyph | UTF8 | en_GB.UTF-8 | en_GB.UTF-8 | =Tc/glyph +
| | | | | glyph=CTc/glyph
The table looks like so:
-- Table: sign
-- DROP TABLE sign;
CREATE TABLE sign
(
id serial NOT NULL,
sign_ref character varying(150) NOT NULL,
CONSTRAINT pk_sign PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE sign
OWNER TO glyph;
-- Index: ix_sign_sign_ref
-- DROP INDEX ix_sign_sign_ref;
CREATE UNIQUE INDEX ix_sign_sign_ref
ON sign
USING btree
(sign_ref COLLATE pg_catalog."default");
Value insertion from my SQL file:
Command: psql -h localhost -d glyph -U glyph -f import.sql
START TRANSACTION;
SET standard_conforming_strings=off;
SET escape_string_warning=off;
SET CONSTRAINTS ALL DEFERRED;
INSERT INTO "sign" VALUES (29585,'DE₂');
COMMIT;
However, this has been converted into "DEâ" in the db. What have I done wrong?
EDIT:
select convert_to(
(select sign_ref from sign where id = 29585), 'utf8');
Gives me:
DE\303\242\302\202\302\202
Upvotes: 2
Views: 3302
Reputation: 61546
The problem is that the input (the INSERT query) was encoded in UTF-8
but interpreted as LATIN1
(or LATIN9
), probably because of a wrong client_encoding
setting.
The line of reasoning is:
DE₂
in utf-8 is 44 45 e2 82 82 (in hex)DEâ
and two control characters 0x82. That's what enters the database.â
encoded in utf-8 followed by two U+0082 encoded in utf-8. In hex bytes, that's 44 45 c3 a2 c2 82 c2 82. In octal, that's DE\303\242\302\202\302\202
which is precisely what you're getting with convert_to(sign_ref, 'utf-8')
. Demo with a terminal in utf-8:
$ cat > test.sql
create table test(t text);
set client_encoding to latin1; -- intentional mistake for the purpose of demoing
insert into test values('DE₂');
$ psql -d test -f test.sql
$ psql -d test
select * from test;
t
-----------------
DEâ\u0082\u0082
set bytea_output='escape';
select convert_to(t, 'utf-8') from test;
convert_to
----------------------------
DE\303\242\302\202\302\202
Upvotes: 3