J1989
J1989

Reputation: 11

HTML Purifier filters out underscore (_) from URL when inserting hyperlink

On a website, I am inserting a link with a hyperlink inside, like so:

For more information <a href="https://example.com/sample_doc.html">read the docs</a>.

HTML Purifier is filtering out the entire URL, so that I'm not able to insert these URLs.

Output of HTML Purifier (by using the demo website):

For more information <a>read the docs</a>.

Is there a way to change the config, allowing HTML Purifier to allow underscores in my URLs?

I read the documentation of HTML Purifier but couldn't find an answer to my question.

My current config (default) looks like this:

{
  "Attr.AllowedFrameTargets": [
    "_blank"
  ],
  "Attr.EnableID": true,
  "HTML.AllowedComments": [
    "pagebreak"
  ],
  "HTML.SafeIframe": true,
  "URI.SafeIframeRegexp": "%^(https?:)?//(www.youtube.com/|player.vimeo.com/)%"
}

Upvotes: 0

Views: 30

Answers (2)

pinkgothic
pinkgothic

Reputation: 6179

The underscore in your URL isn't the problem. The code snippet you're testing:

For more information <a href="https:/example.com/sample_doc.html">read the docs</a>.

...is missing a / after https:/. If you enter this:

For more information <a href="https://example.com/sample_doc.html">read the docs</a>.

...then HTML Purifier will leave it alone.

The configuration you've toggled, Core.AllowHostnameUnderscore, allows URLs with an underscore in the hostname, like https://foo_bar.com/. From the documentation:

By RFC 1123, underscores are not permitted in host names. (This is in contrast to the specification for DNS, RFC 2181, which allows underscores.) However, most browsers do the right thing when faced with an underscore in the host name, and so some poorly written websites are written with the expectation this should work. Setting this parameter to true relaxes our allowed character check so that underscores are permitted.

You shouldn't actually need this. If you're likely to have a lot of input that contains https:/ as opposed to https:// in its URLs, consider doing some preprocessing instead to replace these faulty URLs.

Upvotes: 0

J1989
J1989

Reputation: 11

I found the solution in the documentation of HTML Purifier: http://htmlpurifier.org/live/configdoc/plain.html#Core.AllowHostnameUnderscore

Adding the following code to the Default.json config file allows me to add underscores to hostnames and URLs now

"Core.AllowHostnameUnderscore": true,

Upvotes: 1

Related Questions