What does a string with large code points look like when encoded in an encoding scheme of insufficient capacity?

Question

After reading Joel's article on Unicode, I still feel very unsure of my unicode knowledge. Specifically, I'm left with this question:

Say I have a string with code points too large to fit in some encodings (i.e. ASCII), for example:

U+67CF U+1AAB U+ABCD U+7034

Then Joel says:

If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box.

But what does this string look like (at the binary/hex level) encoded in ASCII or some other encoding of insufficient size?

deceze · Accepted Answer

If you convert the string, say, "ユニコード" to ASCII, there are no codes defined in ASCII that can represent any of these characters. It is entirely up to the conversion software what should be done then. Typically the software will replace any characters it cannot encode with a "?", i.e. literally the ASCII question mark character. The string is then a regular ASCII string containing regular ASCII question mark characters.

See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text as a more in-depth follow-up to Joel's article.

What does a string with large code points look like when encoded in an encoding scheme of insufficient capacity?

Answers (2)

Related Questions