Reputation: 49085
After reading Joel's article on Unicode, I still feel very unsure of my unicode knowledge. Specifically, I'm left with this question:
Say I have a string with code points too large to fit in some encodings (i.e. ASCII), for example:
U+67CF U+1AAB U+ABCD U+7034
Then Joel says:
If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box.
But what does this string look like (at the binary/hex level) encoded in ASCII or some other encoding of insufficient size?
Upvotes: 0
Views: 158
Reputation: 521995
If you convert the string, say, "ユニコード" to ASCII, there are no codes defined in ASCII that can represent any of these characters. It is entirely up to the conversion software what should be done then. Typically the software will replace any characters it cannot encode with a "?", i.e. literally the ASCII question mark character. The string is then a regular ASCII string containing regular ASCII question mark characters.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text as a more in-depth follow-up to Joel's article.
Upvotes: 3
Reputation: 201518
The quoted statement does not make much sense. If an encoding has no code for a Unicode code point, then you simply cannot represent that code point in it. That’s it. You cannot represent “é” in ASCII, for example.
Perhaps the statement is meant to say that if you try to convert a string from one encoding to another and some character in the string does not have a representation in the target encoding, then you may see odd characters. Well, yes, but you could see anything else too. The conversion program could map “é” to “e”, or it could issue an error message and refuse to generate output proper. Normally, the latter is the correct move.
But there are situations where conversions are made in the fly and cannot get entangled in human interaction but must do something. It’s of course not character code conversion any more then, but conversion in a broader sense. And it could apply many different strategies, like just dropping characters, or mapping them to representable characters or character combinations by some logic, or even changing the target encoding.
Upvotes: 2