Reputation: 12509
In the specification by W3c I cannot seem to find a definition of what the set of characters permitted to appear inside values of attributes in XML documents looks like.
Please quote the part of the specification which answers my question.
Upvotes: 2
Views: 5520
Reputation: 338386
XML attributes allow character data (a.k.a. CDATA). See the formal definition of attribute types, under "string type".
Fundamentally, one must make a difference between the XML source (i.e., as it would appear in a text editor) and the DOM (i.e., as it would exist in memory, after parsing the XML source).
Attributes can contain literal newlines (\n
) in the XML source, like this:
<elem attr="a
linebreak">
but such newlines will be converted into a a space during XML parsing. This is called attribute-value normalization.
In order to get a newline character after parsing, it must be encoded in the XML source, either as 

or the equivalent,
.
Normally the DOM API does that for you when you manipulate a document and save it. Unfortunately there are non-compliant APIs that do not correctly encode newlines in attribute values. These APIs make it impossible to retain newline characters.
The same thing occurs with the tab character (\t
). It may exist in the XML source code, but it will be normalized into a single space upon parsing. To prevent that it must be encoded, either as 	
or 	
.
Bottom line: if you interact with an XML document through an API (and you should!), all these details are being taken care of for you, unless of course the API is broken.
For the sake of completeness: Owing to a rather short-sighted (IMHO) decision, literal >
characters are allowed inside attributes in the XML source code. Only literal <
are forbidden:
<elem attr="this > that" /> <!-- legal syntax -->
<elem attr="this < that" /> <!-- syntax error -->
I'd recommend against using that quirk. Most APIs will insert the escaped form >
anyway:
<elem attr="this > that" />
<elem attr="this < that" />
Upvotes: 6
Reputation: 122414
http://www.w3.org/TR/xml/#NT-AttValue is the production you're looking for, essentially it says that an attribute value may contain any character except less-than, ampersand (except where part of a valid character or entity reference), or the quote character used around the value (single quoted attributes can contain double quotes and double quoted attributes can contain single quotes but not vice-versa).
As Tomalak states, newline characters are allowed, but they won't be reported as newlines by a parser.
Upvotes: 2
Reputation: 3915
Section 2.3 defines common syntactic constructs. In particular there is an AttValue
rule:
AttValue ::= '"' ([^<&"] | Reference)* '"'
| "'" ([^<&'] | Reference)* "'"
Upvotes: 4