Reputation: 3117
HTML 4 states pretty which characters should be escaped:
Four character entity references deserve special mention since they are frequently used to escape special characters:
- "<" represents the < sign.
- ">" represents the > sign.
- "&" represents the & sign.
- "" represents the " mark.
Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values.
I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively.
Could somewhat point to the official source on this matter?
Upvotes: 27
Views: 61623
Reputation: 224886
The specification defines the syntax for normal elements as:
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
So you have to escape <
, or &
when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)
These rules don’t apply to <script>
and <style>
; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>
, replace <
with \x3c
, the U+2028 character with \u2028
, and U+2029 with \u2029
after JSON serialization.)
Upvotes: 11
Reputation: 152966
From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments
Escaping a string (for the purposes of the algorithm* above) consists of running the following steps:
- Replace any occurrence of the "&" character by the string "&".
- Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
- If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
- If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any occurrences of the ">" character by the string ">".
*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML
getter.
Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:
&
character should be replaced by &
(surprise!...)"
should be escaped as "
<
should be escaped as <
and >
should be escaped as >
I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.
Upvotes: 6
Reputation: 51990
Adding my voice to insist that things are not that easy -- strictly speaking:
(the most common)
If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."
An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"
So, in that case editable && copy
(notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.
As a counter example: editable&©
is not safe (even if this might work) as the last sequence ©
might be interpreted as the entity reference for ©
(the less common)
Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &
.
In that case &&
(with or without spaces) is invalid XML. You should write &&
Tricky, isn't it ?
Upvotes: 3