Reputation: 13682
I'm working on an HTML formatter (full source code here: https://github.com/kshetline/fortissimo-html). Although I've tried to read through the official HTML specs, I'm still having a hard time nailing down completely the rules for where whitespace is significant, and where it isn't.
There seem to be three categories of whitespace significance:
<pre>
or <textarea>
tags.I'm clear on <pre>
and <textarea>
belonging to category 1.
I'm clear on formatting tags like <i>
and <strong>
, etc., and <span>
as well, belonging to category 2.
I'm also clear on the content of tags like <p>
, <td>
, and <li>
belonging to category 3.
Where it gets more confusing for me is in between tags like those above. It seems clear (but only by example) that whitespace between table elements doesn't matter (although non-whitespace, which doesn't properly belong there, does some weird things).
Whitespace does, however, oddly matter inside <ul>
and <ol>
tags (often to the annoyance of people using <ul>
/<li>
markup for menus and nav bars).
So, if I'm reformatting HTML either to make it prettier with nice indentation, or to condense it and remove as much whitespace as possible, what's a good guide to where I...
...and not change how a typical browser renders the final HTML?
(By the way, I'm aware that CSS can change the behavior of almost any tag to, say, act like a <pre>
tag, but I'm only concerned with general rules where there are no CSS modifications, or very common cases like inlined <ul>
/<li>
.)
table {
display: inline-table;
border-spacing: 0;
}
td {
padding: 0;
}
li {
display: inline-block;
}
<body>
<span>One</span>
<span>Two</span>
<span>Three</span><span>Four</span>
<table>x
<tr>
<td> 1 </td> <td> 2 </td>
</tr>
<tr><td>3</td><td>4</td></tr>y
</table>z
<br>
<ul>
<li>Foo
<li> Bar</li>
<li>Baz</li>
</ul>
</body>
https://codepen.io/kshetline/pen/KKKKZVK
Upvotes: 3
Views: 620
Reputation: 13682
For the purposes of this discussion, whitespace is as defined here: https://infra.spec.whatwg.org/#ascii-whitespace
ASCII whitespace is U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE.
Non-breaking whitespace and other Unicode space characters are handled very much like any other non-space character. (U+000B might also go along for the ride as whitespace, even though it's not listed above.)
So, from what I've determined so far, this is my best understanding now of HTML whitespace handling:
<textarea>
is a special case where almost all of the whitespace inside is significant, because it directly affects the value assigned to the textarea as an input field. While the rendering of this element can be changed via CSS, whitespace remains significant to the value of the input regardless of rendering.<pre>
, of course, is specifically designed to make almost all whitespace significant in how text will be rendered.<textarea>
to have a single newline become part of the input value). From what I can tell, this special newline handling goes beyond what can be specified via CSS. Within any other element, single newlines alone take effect if you use CSS whitespace: pre
and relating styling. (See example here: https://jsfiddle.net/kshetline/a510tq4v)<script>
and <style>
tags are obviously special cases, with language-specific whitespace significance. Except for the content of quoted values, however, and the newline needed at the end of a JavaScript line comment (//
), all whitespace can be either reduced to a single whitespace, or omitted where not syntactically required.
Whitespace inside CDATA sections should be left as is, although it may or may not render depending on circumstances.
Whitespace is all significant inside quoted attribute values.
Setting aside the above cases, or the application of the CSS whitespace
property, all contiguous sequences of whitespace will otherwise be treated the same way: as a single whitespace. Three spaces, twenty newlines, a dozen alternating form-feed and tab characters, it's all the same as a single space.
No whitespace is allowed between the <
that starts a tag and the tag's name, e.g. <br>
works, but < br>
doesn't. Without belaboring the point, it's fairly obvious that extra spacing can prevent an end tag, a comment, etc., from parsing correctly as well.
Whitespace is needed syntactically to separate tag names from attribute names, and attribute names from other attribute names and attribute values where quoting doesn't make that separation clear.
Now here's the real meat of the subject that I was trying to get at for performing HTML reformatting without affecting how the HTML is typically rendered by most browsers:
<b>
or <code>
or <strong>
, and the <span>
element as well, are inline elements. Here's a list of inline elements, as it appears in the code I'm currently working on, obtained mostly from the HTML formatting settings of Intellij IDEA: inline: new Set(['a', 'abbr', 'acronym', 'b', 'basefont', 'bdo', 'big', 'br', 'cite', 'cite', 'code', 'dfn',
'em', 'font', 'i', 'img', 'input', 'kbd', 'label', 'q', 's', 'samp', 'select', 'small', 'span',
'strike', 'strong', 'sub', 'sup', 'text', 'tt', 'u', 'var']),
For text within inline elements, all interior contiguous whitespace is treated as a single space. Leading space is trimmed when it is at the start of a block element, or immediately follows a block element. Trailing space is trimmed when it is at the end of a block element, or immediately precedes a block element. For the purposes of rendering, a CDATA section is treated inline as far a neighboring text is concerned.
Consider:
<p> Hello,<i> world!</i> </p> It's Tuesday.
The space in front of "Hello" will not be rendered, because it's at the start of a block. The space in front of "world", however, will be rendered. The space after </i>
will not be rendered, because it's at the end of a block. The space in front of "It's" will not be rendered, because it follows a block.
Now consider this:
<p>I'm <b> not </b> going to pay a lot for this muffler.</p>
The space after "I'm" and the space before "not" will together be treated as a single space. So will the space after "not" and the space before "going". Which spaces are kept, and which are thrown away? A little CSS styling, giving <b>
a background color, reveals the leading spaces being discarded in favor of the trailing spaces:
This appears consistent using Chrome, Firefox, Safari, IE, and Edge.
And now a really simple example:
<div> tree
bark </div>
None of that leading or trailing spaces matters — it won't be rendered. No leading space, no trailing space, no additional line breaks. Just "tree bark".
Upvotes: 2