Understanding Unicode: Surrogate Blocks, Noncharacters

Question

I am trying to actually understand the unicode standard and was poking through the xml spec where it reads:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Now I have a couple of questions:

What are the surrogate blocks? Are they the UTF-16 codes that indicate a 4 byte code point?
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here? Isn't it the task of an encoding to hide those encoding-related details from the space the encoding maps from?
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.

Thanks for clarification!

Remy Lebeau · Accepted Answer

What are the surrogate blocks?

Unicode codepoints in the U+D800 to U+DFFF range, inclusive, which are reserved for exclusive use as UTF-16 surrogates and are illegal in any other context.

Are they the UTF-16 codes that indicate a 4 byte code point?

Yes.

Does #xXXXX refer to the code point or to the UTF-16 encoded value here?

The actual Unicode codepoints. Considering that the definition of Char includes values > #xFFFF, which individual encoded UTF-16 values cannot exceed. UTFs are byte encoding schemes for codepoint values. The XML spec is written in terms of codepoints, not encodings. An XML document can be encoded in any charset specified in the "encoding" attribute of the XML prolog, for purposes of storage and transmission, but the actual XML content is processed in terms of unencoded codepoints.

If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here?

The surrogate codepoints are reserved and not allowed to appear unencoded in any textual content. The Char definition is simply enforcing that rule.

Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.

Because the encoding is not always known ahead of time, and may have to be detected dynamically. U+FFFE is used as a BOM marker to help facilitate that. Early versions of Unicode allowed U+FFFE to be used as either a BOM or an actual non-breaking space character within textual content. That lead to ambiguity at times. So newer versions of Unicode reserve U+FFFE strictly as a BOM only, and non-breaking spacing is handled by U+2060 WORD JOINER instead to avoid any ambiguity.

That being said, in the context of XML, it doesn't make sense to use U+FFFE in any textual content. The entire document is encoded in a particular charset, and any BOM used would have to appear before the XML prolog. The XML spec defines BOM handling and charset detection outside of the XML document itself. So that is why the Char definition excludes U+FFFE.

U+FFFF is reserved and is not intended to ever be used in real content in practice. So that is why the Char definition excludes it.

So basically the Char definition allows all Unicode codepoints minus restricted codepoints.

Understanding Unicode: Surrogate Blocks, Noncharacters

Answers (1)

Related Questions