Dušan Rychnovský
Dušan Rychnovský

Reputation: 12509

Which characters are permitted in XML attributes?

In the specification by W3c I cannot seem to find a definition of what the set of characters permitted to appear inside values of attributes in XML documents looks like.

  1. Is it the same as with the text content of elements?
  2. Or is it just a subset (excluding e.g. \n)?

Please quote the part of the specification which answers my question.

Upvotes: 2

Views: 5520

Answers (3)

Tomalak
Tomalak

Reputation: 338386

XML attributes allow character data (a.k.a. CDATA). See the formal definition of attribute types, under "string type".

Fundamentally, one must make a difference between the XML source (i.e., as it would appear in a text editor) and the DOM (i.e., as it would exist in memory, after parsing the XML source).

Attributes can contain literal newlines (\n) in the XML source, like this:

<elem attr="a
linebreak">

but such newlines will be converted into a a space during XML parsing. This is called attribute-value normalization.

In order to get a newline character after parsing, it must be encoded in the XML source, either as &#xA; or the equivalent, &#10;.

Normally the DOM API does that for you when you manipulate a document and save it. Unfortunately there are non-compliant APIs that do not correctly encode newlines in attribute values. These APIs make it impossible to retain newline characters.

The same thing occurs with the tab character (\t). It may exist in the XML source code, but it will be normalized into a single space upon parsing. To prevent that it must be encoded, either as &#x9; or &#9;.

Bottom line: if you interact with an XML document through an API (and you should!), all these details are being taken care of for you, unless of course the API is broken.


For the sake of completeness: Owing to a rather short-sighted (IMHO) decision, literal > characters are allowed inside attributes in the XML source code. Only literal < are forbidden:

<elem attr="this > that" />  <!-- legal syntax -->
<elem attr="this < that" />  <!-- syntax error -->

I'd recommend against using that quirk. Most APIs will insert the escaped form &gt; anyway:

<elem attr="this &gt; that" />
<elem attr="this &lt; that" />

Upvotes: 6

Ian Roberts
Ian Roberts

Reputation: 122414

http://www.w3.org/TR/xml/#NT-AttValue is the production you're looking for, essentially it says that an attribute value may contain any character except less-than, ampersand (except where part of a valid character or entity reference), or the quote character used around the value (single quoted attributes can contain double quotes and double quoted attributes can contain single quotes but not vice-versa).

As Tomalak states, newline characters are allowed, but they won't be reported as newlines by a parser.

Upvotes: 2

PoByBolek
PoByBolek

Reputation: 3915

Section 2.3 defines common syntactic constructs. In particular there is an AttValue rule:

AttValue       ::=      '"' ([^<&"] | Reference)* '"'
                     |  "'" ([^<&'] | Reference)* "'"

Upvotes: 4

Related Questions