Reputation: 3
I have a set of image imported from MSSQL in csv. The file size is 1gb. Datatype in MSSQL is image. When I want to import to Postgres, datatype in bytea, error occured.
ERROR: invalid byte sequence for encoding "UTF8": 0xff
CONTEXT: COPY photo, line 1
When I look into the csv file, the image file is in
0xFFD8FFE000104A46494600010101006000600000FFE1...
My questions:
Solution that I tried:
http://pastebin.com/WrfjFqY6 This is a sample of line in the csv. 2 columns, id and photo.
Anyone know how to solve this? Thanks in advance.
Upvotes: 0
Views: 570
Reputation: 324831
As yenyen notes in the comments, the issue was that the input was UCS-2 (probably really UTF-16) encoded.
UCS-2 is a two-byte-per-character encoding that contains null bytes. If you tell PostgreSQL the file is utf-8 then it'll see the input as garbage full of invalid utf-8 sequences. If you tell PostgreSQL it's a simple 1-byte encoding like latin1, PostgreSQL will see the zero (null) byte and realise it's not latin-1 after all.
The trick here is to examine the input file with an editor that can show the raw bytes, not just use a text editor that automagically reads the BOM and loads it as encoded text. If in doubt use a hex editor.
Upvotes: 1