Reputation: 119164
I want to be able to accept HTML from untrusted users and sanitize it so that I can safely include it in pages on my website. By this I mean that markup should not be stripped or escaped, but should be passed through essentially unchanged unless it contains dangerous tags such as <script>
or <iframe>
, dangerous attributes such as onload
, or dangerous CSS properties such as background URLs. (Apparently some older IEs will execute javascript URLs in CSS?)
Serving the content from a different domain, enclosed in an iframe, is not a good option because there is no way to tell in advance how tall the iframe has to be so it will always look ugly for some pages.
I looked into HTML Purifier, but it looks like it doesn't support HTML5 yet. I also looked into Google Caja, but I'm looking for a solution that doesn't use scripts.
Does anyone know of a library that will accomplish this? PHP is preferred, but beggars can't be choosers.
Upvotes: 5
Views: 2227
Reputation: 1099
I personally use HTML Purifier for this exact purpose:
It works well and allows you to customize down to every tag and attribute. So far I have had no security issues with this plugin.
Upvotes: 0
Reputation: 119164
I decided to just use html5lib-python. This is what I came up with:
#!/usr/bin/env python
import sys
from xml.dom.minidom import Node
import html5lib
from html5lib import (HTMLParser, sanitizer, serializer, treebuilders,
treewalkers)
parser = HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
tree=treebuilders.getTreeBuilder("dom"))
serializer = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False)
document = parser.parse(sys.stdin.read(), encoding="utf-8")
# find the <html> node
for child in document.childNodes:
if child.nodeType == Node.ELEMENT_NODE and child.nodeName == 'html':
htmlNode = child
# find the <body> node
for child in htmlNode.childNodes:
if child.nodeType == Node.ELEMENT_NODE and child.nodeName == 'body':
bodyNode = child
# serialize all children of the <body> node
for child in bodyNode.childNodes:
stream = treewalkers.getTreeWalker("dom")(child)
sys.stdout.write(serializer.render(stream, encoding="utf-8"))
Example input:
<script>alert("hax")</script>
<p onload="alert('this is a dangerous attribute')"><b>hello,</b> world</p>
Example output:
<script>alert("hax")</script>
<p><b>hello,</b> world</p>
Upvotes: 0
Reputation: 13157
See WdHTMLParser class. I use this class for my forum.
This class parse the HTML to an array :
<div>
<span>
<br />
<span>
un bout de texte
</span>
<input type="text" />
</span>
</div>
Array :
Array (
[0] => Array (
[name] => div
[args] => Array ()
[children] => Array (
[0] => Array (
[name] => span
[args] => Array ()
[children] => Array (
[0] => Array (
[name] => br
[args] => Array ()
)
[1] => Array (
[name] => span
[args] => Array ()
[children] => Array (
[0] => un bout de texte
)
)
[2] => Array (
[name] => input
[args] => Array (
[type] => text
)
)
)
)
)
)
)
I use this class on my website to convert array to HTML.
voyageWdHTML_allowattr : These attributes will be allowed.
voyageWdHTML_allowtag : These tags will be allowed.
voyageWdHTML_special : Make your own rules. Actually, I add "_blank" to each link. And replace <br>
to new line (\n) in pre tag.
fix_javascript : You can to enable/disable this function, but it is useless.
<?php
include "WdHTMLParser.php";
include "parser.php";
list($erreur, $message) = (new Parser())->parseBadHTML("<div>
<span>
<a onclick=\"alert('Hacked ! :'(');\">Check javascript</a>
<script>alert(\"lol\");</script>
</span>
</div>");
if ($erreur) {
die("Error : ".$message);
}
echo $message;
<div>
<span>
<a target="_blank">Check javascript</a>
<pre>alert("lol");</pre>
</span>
</div>
<?php
class Parser {
//private function fix_javascript(&$message) { }
private function voyageWdHTML_args($tab_args, $objname) {
$html = "";
foreach ($tab_args as $attr => $valeur) {
if ($valeur !== null && $this->voyageWdHTML_allowattr($attr)) {
$html .= " $attr=\"".htmlentities($valeur)."\"";
}
}
return $html;
}
private function voyageWdHTML_allowattr($attr) {
return in_array($attr, array("align", "face", "size", "href", "title", "target", "src", "color", "style",
"data-class", "data-format"));
}
private function voyageWdHTML_allowtag($name) {
return in_array($name, array("br", "b", "i", "u", "strike", "sub", "sup", "div", "ol", "ul", "li", "font", "span", "code",
"hr", "blockquote", "cite", "a", "img", "p", "pre", "h6", "h5", "h4", "h3", "h2", "h1"));
}
private function voyageWdHTML_special(&$obj) {
if ($obj["name"] == "a") { $obj["args"]["target"] = "_blank"; }
if ($obj["name"] == "pre") {
array_filter($obj["children"], function (&$var) {
if (is_string($var)) { return true; }
if ($var["name"] == "br") { $var = "\n"; return true; }
return false;
});
}
}
private function voyageWdHTML($tableau, $lvl = 0) {
$html = "";
foreach ($tableau as $obj) {
if (is_array($obj)) {
if (!$this->voyageWdHTML_allowtag($obj["name"])) {
$obj["name"] = "pre";
if (!isset($obj["children"])) {
$obj["children"] = array();
}
}
if (isset($obj["children"])) {
$this->voyageWdHTML_special($obj);
$html .= "<{$obj["name"]}{$this->voyageWdHTML_args($obj["args"], $obj["name"])}>{$this->voyageWdHTML($obj["children"], $lvl+1)}</{$obj["name"]}>";
} else {
$html .= "<{$obj["name"]}>";
}
} else {
$html .= $obj;
}
}
return $html;
}
public function parseBadHTML($message) {
$WdHTMLParser = new WdHTMLParser();
$message = str_replace(array("<br>", "<hr>"), array("<br/>", "<hr/>"), $message);
$tableau = $WdHTMLParser->parse($message);
if ($WdHTMLParser->malformed) {
$retour = $WdHTMLParser->error;
} else {
$retour = $this->voyageWdHTML($tableau);
//$this->fix_javascript($retour);// To make sur
}
return array($WdHTMLParser->malformed, $retour);
}
}
<?php
class WdHTMLParser {
private $encoding;
private $matches;
private $escaped;
private $opened = array();
public $malformed;
public function parse($html, $namespace = NULL, $encoding = 'utf-8') {
$this->malformed = false;
$this->encoding = $encoding;
$html = $this->escapeSpecials($html);
$this->matches = preg_split('#<(/?)' . $namespace . '([^>]*)>#', $html, -1, PREG_SPLIT_DELIM_CAPTURE);
$tree = $this->buildTree();
if ($this->escaped) {
$tree = $this->unescapeSpecials($tree);
}
return $tree;
}
private function escapeSpecials($html) {
$html = preg_replace_callback('#<\!--.+-->#sU', array($this, 'escapeSpecials_callback'), $html);
$html = preg_replace_callback('#<\?.+\?>#sU', array($this, 'escapeSpecials_callback'), $html);
return $html;
}
private function escapeSpecials_callback($m) {
$this->escaped = true;
$text = $m[0];
$text = str_replace(array('<', '>'), array("\x01", "\x02"), $text);
return $text;
}
private function unescapeSpecials($tree) {
return is_array($tree) ? array_map(array($this, 'unescapeSpecials'), $tree) : str_replace(array("\x01", "\x02"), array('<', '>'), $tree);
}
private function buildTree() {
$nodes = array();
$i = 0;
$text = NULL;
while (($value = array_shift($this->matches)) !== NULL) {
switch ($i++ % 3) {
case 0: {
if (trim($value)) {
$nodes[] = $value;
}
}
break;
case 1: {
$closing = ($value == '/');
}
break;
case 2: {
if (substr($value, -1, 1) == '/') {
$nodes[] = $this->parseMarkup(substr($value, 0, -1));
} else if ($closing) {
$open = array_pop($this->opened);
if ($value != $open) {
$this->error($value, $open);
}
return $nodes;
} else {
$node = $this->parseMarkup($value);
$this->opened[] = $node['name'];
$node['children'] = $this->buildTree($this->matches);
$nodes[] = $node;
}
}
}
}
return $nodes;
}
public function parseMarkup($markup) {
preg_match('#^[^\s]+#', $markup, $matches);
$name = $matches[0];
preg_match_all('#\s+([^=]+)\s*=\s*"([^"]+)"#', $markup, $matches, PREG_SET_ORDER);
$args = array();
foreach ($matches as $m) {
$args[$m[1]] = html_entity_decode($m[2], ENT_QUOTES, $this->encoding);
}
return array('name' => $name, 'args' => $args);
}
public function error($markup, $expected) {
$this->malformed = true;
printf('unexpected closing markup "%s", should be "%s"', $markup, $expected);
}
}
<?php
class Parser {
private function fix_javascript(&$message) {
$js_array = array(
"#(&\#(0*)106;?|&\#(0*)74;?|&\#x(0*)4a;?|&\#x(0*)6a;?|j)((&\#(0*)97;?|&\#(0*)65;?|a)(&\#(0*)118;?|&\#(0*)86;?|v)(&\#(0*)97;?|&\#(0*)65;?|a)(\s)?(&\#(0*)115;?|&\#(0*)83;?|s)(&\#(0*)99;?|&\#(0*)67;?|c)(&\#(0*)114;?|&\#(0*)82;?|r)(&\#(0*)105;?|&\#(0*)73;?|i)(&\#112;?|&\#(0*)80;?|p)(&\#(0*)116;?|&\#(0*)84;?|t)(&\#(0*)58;?|\:))#i",
"#(o)(nmouseover\s?=)#i",
"#(o)(nmouseout\s?=)#i",
"#(o)(nmousedown\s?=)#i",
"#(o)(nmousemove\s?=)#i",
"#(o)(nmouseup\s?=)#i",
"#(o)(nclick\s?=)#i",
"#(o)(ndblclick\s?=)#i",
"#(o)(nload\s?=)#i",
"#(o)(nsubmit\s?=)#i",
"#(o)(nblur\s?=)#i",
"#(o)(nchange\s?=)#i",
"#(o)(nfocus\s?=)#i",
"#(o)(nselect\s?=)#i",
"#(o)(nunload\s?=)#i",
"#(o)(nkeypress\s?=)#i"
);
$message = preg_replace($js_array, "$1<b></b>$2$4", $message);
}
}
Upvotes: 1
Reputation: 1592
On Ruby I'm using Nokogiri (php version) to parse HTML content. You can parse user's data and remove unnecessary tags or attributes, and then convert it to text.
phpQuery - another parser.
And in PHP there is a strip_tags function.
Or you can manualy remove all attributes:
$dom = new DOMDocument;
$dom -> loadHTML( $html );
$xpath = new DOMXPath( $dom );
$nodes = $xpath -> query( "//*[@style]" ); // all elements with style attribute
foreach ( $nodes as $node ) {
// remove or do what you want
$node -> removeAttribute( "style" );
}
echo $dom -> saveHTML();
Upvotes: 2
Reputation: 1246
The black listing approach puts you under upgrade pressure. So each time browsers start to support new standards you MUST draw your sanitizing tool to the same level. Such changes happen more often than you think.
White listing (which is achieved by strip_tags with well defined exceptions) of cause shrinks options for your users, but puts you on the save site.
On my own sites I have the policy to apply the black listing on pages for very trusted users (such as admins) and the whitelisting on all other pages. That sets me into the position to not put much effort into the black listing. With more mature role & permission concepts you can even fine grain your black lists and white lists.
UPDATE: I guess you look for this:
I got the point that strip_tags whitelists on tag level but does accept everything on attribute level. Interestingly HTMLpurifier seems to do the whitelisting on attribute level. Thanks, was a nice learning here.
Upvotes: 6
Reputation: 218
Maybe it's better to go on a different approach? How about telling them what they can use?
In that case you can use use strip_tags
. It will be easier and a lot more controllable this way. Very easy to extend in the future aswell
Upvotes: 2
Reputation: 4033
You might be able to do something along the lines of:
preg_replace('/<\s*iframe\s+[^>]*>.*<\s*\/\s*iframe\s+[^>]*>/i', '', $html);
preg_replace('/<\s*script\s+[^>]*>.*<\s*\/\s*script\s+[^>]*>/i', '', $html);
preg_replace('/\s+onload\s+=\s+"[^"]+"/i', '', $html);
... but then again: you have RegExes, now you have two problems - this might remove more than wanted and leave more than wanted as well.
But since HTML Purifier is probably the most modern and well suited (and open source) project you should still use that one and maybe make adjustments if you really need them.
You can check out one of the following as well:
Though you also have to make sure that your own page layout doesn't take a hit in including the results due to not closed tags.
Upvotes: 2