tmountain
tmountain

Reputation: 179

Golang html.Parse rewriting href query strings to contain &

I have the following code:

package main

import (
    "os"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    myHtmlDocument := `<!DOCTYPE html>
<html>
<head>
</head>
<body>
    <a href="http://www.example.com/input?foo=bar&baz=quux">WTF</a>
</body>
</html>`

    doc, _ := html.Parse(strings.NewReader(myHtmlDocument))
    html.Render(os.Stdout, doc)
}

The html.Render function is producing the following output:

<!DOCTYPE html><html><head>

</head>
<body>
    <a href="http://www.example.com/input?foo=bar&amp;baz=quux">WTF</a>

</body></html>

Why is it rewriting the query string and converting & to &amp; (in-between bar and baz)?

Is there a way to avoid this behavior?

I'm trying to do template transformation, and I don't want it mangling my URLs.

Upvotes: 1

Views: 756

Answers (1)

dave
dave

Reputation: 64687

html.Parse wants to generate valid HTML, and the HTML spec states that an amperstand in a href attribute must be encoded.

https://www.w3.org/TR/xhtml1/guidelines.html#C_12

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.

In this case, go is actually making your HTML better and valid

With that being said - browsers will unescape it, so the resulting url if it were to be clicked on would still be the correct one (without the &amp;, just the &:

console.log(document.querySelector('a').href)
 <a href="http://www.example.com/input?foo=bar&amp;baz=quux">WTF</a>

EDIT: Since people are being pedentic in the comments, I'll note that in HTML5 you are not required to escape the ampersand anymore, however it still always valid to escape it. On the otherhand, there are still situations in which it is invalid not to - essentially anytime the ampersand is followed by a semicolon but is not a named character:

An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.

which means that a link like:

<a href="http://www.example.com/input?foo=bar&a;&baz=quux">WTF</a>

would be invalid, yet if it were

<a href="http://www.example.com/input?foo=bar&amp;a;&baz=quux">WTF</a>

it would be valid.

So the parser sticks to a rule that is simpler to implement, and works in all versions of HTML, to make your HTML better and still valid.

Upvotes: 2

Related Questions