Reputation: 629
I am trying to actually understand the unicode standard and was poking through the xml spec where it reads:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Now I have a couple of questions:
Thanks for clarification!
Upvotes: 5
Views: 3554
Reputation: 596853
What are the surrogate blocks?
Unicode codepoints in the U+D800
to U+DFFF
range, inclusive, which are reserved for exclusive use as UTF-16 surrogates and are illegal in any other context.
Are they the UTF-16 codes that indicate a 4 byte code point?
Yes.
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
The actual Unicode codepoints. Considering that the definition of Char
includes values > #xFFFF, which individual encoded UTF-16 values cannot exceed. UTFs are byte encoding schemes for codepoint values. The XML spec is written in terms of codepoints, not encodings. An XML document can be encoded in any charset specified in the "encoding" attribute of the XML prolog, for purposes of storage and transmission, but the actual XML content is processed in terms of unencoded codepoints.
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here?
The surrogate codepoints are reserved and not allowed to appear unencoded in any textual content. The Char
definition is simply enforcing that rule.
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.
Because the encoding is not always known ahead of time, and may have to be detected dynamically. U+FFFE
is used as a BOM marker to help facilitate that. Early versions of Unicode allowed U+FFFE
to be used as either a BOM or an actual non-breaking space character within textual content. That lead to ambiguity at times. So newer versions of Unicode reserve U+FFFE
strictly as a BOM only, and non-breaking spacing is handled by U+2060 WORD JOINER
instead to avoid any ambiguity.
That being said, in the context of XML, it doesn't make sense to use U+FFFE
in any textual content. The entire document is encoded in a particular charset, and any BOM used would have to appear before the XML prolog. The XML spec defines BOM handling and charset detection outside of the XML document itself. So that is why the Char
definition excludes U+FFFE
.
U+FFFF
is reserved and is not intended to ever be used in real content in practice. So that is why the Char
definition excludes it.
So basically the Char
definition allows all Unicode codepoints minus restricted codepoints.
Upvotes: 9