Reputation: 15419
I am trying to store some text (e.g. č
) in a Postgres database, however when retrieving this value, it appears on screen as ?
. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was in UTF-8, however, judging by the first answer, this is an incorrect assumption.
Original question (which may still be valid):
I have read about UTF-8 Surrogate pairs, which may achieve what I require, and I've seen a few examples involving the
stringinfo
objectTextElementEnumerators
, but I couldn't work out a practical proof of concept.Can someone provide an example of how you would write and read UTF-16 (probably using this surrogate pair concept) to a postgres database. Thank you.
Updated question:
Why would the č
character be returned from the database as a question mark?
We use NPGSQL to access the database and VB.Net.
Upvotes: 6
Views: 14966
Reputation: 316
As to the problem storing/retrieving č
Check the character set the Postgre db is running on is UTF-8 character set (https://www.postgresql.org/docs/9.1/multibyte.html ) or a character set which can represent the character.
Check that the client connection to the database is set up to perform the appropriate codepage conversion ( for VB.Net this would be from UTF-16LE to UTF-8 or the database charset, this is typically a parameter on the connection string (charset) ).
Check that the input is the actual UTF-8 / UTF-16 in VB.net byte sequence, not the Windows-1250 byte sequence.
Check that this is not simply a limitation of the output tool or console (e.g. a Windows console typically does not display unicode characters but uses Windows-12xx character set (one can try https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8) but typically inspecting the byte sequence in a VB.Net debugger is best.
Check that the length of the CHAR/VARCHAR column is sufficient to store your representation, even if represented in NFKD decomposition.
The grapheme you indicate has several different unicode representations.
U+010D LATIN SMALL LETTER C WITH CARON
U+0063 LATIN SMALL LETTER c followed by U+030C COMBINING CARON
And a different representations other character sets (e.g. 0xE8 in ISO-8859-2/Windows-1250 (https://en.wikipedia.org/wiki/Windows-1250) or ISO-8859-13 /Windows-1257.
All unicode representations fall into the basic multilingual plane, so the UTF-16 surrogate issue with postgre as indicated in the question title and answered below is likely irrelevant to your problem.
Upvotes: 0
Reputation: 316
Short answer, this is not directly possible as PostgreSQL only supports a UTF-8 character set.
UTF-16 based formats like Java, JavaScript, Windows can contain half surrogate pairs which have no representation in UTF-8 or UTF-32. These may easily be created by sub-stringing a Java, JavaScript, VB.Net string. As they cannot be represented in UTF-8 or UTF-32 and thus cannot be stored in a database which only supports an UTF-8 character set like PostgreSQL.
Windows Path names may contain half surrogate pairs which cannot be read as utf-8 ( https://github.com/rust-lang/rust/issues/12056 ).
One would have to use database system which supports a UTF-16/CESU-8 character set which is more adapted to Java/Android, JavaScript/NodeJS, .Net/wchar_t/Windows languages/platforms. (SQLServer, Oracle (UTF-8 collation), DB2, Informix, HANA, SQL Anywhere, MaxDB typically support such a charset.
Note that with emoticons being represented as unicode codepoints outside the Basic Multilingual Plane these differences will become more relevant also for western users.
On postgres you may: a) Accept the losses, b) Store the data as binary data or c) translate them to an encoded representation (e.g. the JSON rfc encodes them as two escaped characters to be able to transport half surrogates within an UTF-8/Ascii based network format without loss (https://www.rfc-editor.org/rfc/rfc4627 Section 2.5).
With e.g. emoticons being located outside the Basic multilingual plane this problem will become more relevant also in the western world.
Depending on the pick of language Application Server ( Java,Scala, C#/Windows, JavaScript/NodeJS) vs go and the level of investment into language support (using e.g. ICU string splitting functions at grapheme boundaries (https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) instead of simple truncation the issue may be less relevant. But the majority of enterprise systems and languages fall in the UTF-16 camp today, with software using a simple sub-string operations.
Upvotes: 1
Reputation: 1503110
There's no such thing as a character which exists in UTF-16 but not UTF-8. Both are capable of encoding all of Unicode. In other words, if you can get UTF-8 to work, it should be able to store any valid Unicode text.
EDIT: Surrogate pairs are actually a feature of UTF-16 rather than UTF-8. They allow a character which isn't in the basic multi-lingual plane (BMP) to be represented as two UTF-16 code units. Basically, UTF-16 is often treated as a fixed-width encoding (exactly two bytes per Unicode character) but that only allows the BMP to be encoded cleanly. Surrogate pairs are a (fairly hacky) way of extending the range beyond the BMP.
I very much doubt that the character you're trying to represent is outside the BMP, so I suspect you need to look elsewhere for the problem. In particular, it's worth dumping the exact character values of the text (e.g. by casting each char
to int
) before it goes into the database and after you've fetched it. Ideally, do this in a short but complete console app.
Upvotes: 11