String that cannot be represented in UTF-8

Question

I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.

Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?

BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.

sstan · Accepted Answer

The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.

In particular, notice the following quote (emphasis mine):

Q: What is a UTF?

A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept.

Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must map all code points (except surrogate code points) to unique byte sequences. This includes reserved (unassigned) code points and the 66 noncharacters (including U+FFFE and U+FFFF).

So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).

Additionally, here is a quote directly from the Unicode Standard that also talks about this:

The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.

As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).

String that cannot be represented in UTF-8

Answers (2)

Related Questions