Reputation: 11943
I have around ~2 Million stored HTML pages in S3 that contain various HTML. I'm trying to extract only the content from those stored pages, but I wish to retain the HTML structure with certain constraints. This HTML is all user-supplied input and should be considered unsafe. So for display purposes, I want to retain only some of the HTML tags with a constraint on attributes and attribute values, but still retain all of the properly encoded text content inside even disallowed tags.
For example, I'd like to allow only specific tags like <p>
, <h1>
, <h2>
, <h3>
, <ul>
, <ol>
, <li>
, etc.. But I also want to keep whatever text is found between disallowed tags and maintain its structure. I also want to be able to restrict attributes in each tag or force certain attributes to be applied to specific tags.
For example, in the following HTML...
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>
I'd like the result to be...
Some text...
<p>Hello PHP!</p>
Thus stripping out the unwanted <div>
and <span>
tags, the unwanted attributes of all tags, and still maintaining the text inside <div>
and <span>
.
Simply using strip_tags()
won't work here. So I tried doing the following with DOMDocuemnt.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach($dom->childNodes as $node) {
if ($node->nodeName != "p") { // only allow paragraph tags
$text = $node->nodeValue;
$node->parentNode->nodeValue .= $text;
$node->parentNode->removeChild($node);
}
}
echo $dom->saveHTML();
Which would work on simple cases where there aren't nested tags, but obviously fails when the HTML is complex.
I can't exactly call this function recursively on each of the node's child nodes because if I delete the node I lose all further nested children. Even if I defer node deletion until after the recursion the order of text insertion becomes tricky. Because I try to go deep and return all valid nodes then start concatenating the values of the invalid child nodes together and the result is really messy.
For example, let's say I want to allow <p>
and <em>
in the following HTML
<p>Hello <strong>there <em>PHP</em>!</strong></p>
But I don't want to allow <strong>
. If the <strong>
has nested <em>
my approach gets really confusing. Because I'd get something like ...
<p>Hello there !<em>PHP</em></p>
Which is obviously wrong. I realized getting the entire nodeValue
is a bad way of doing this. So instead I started digging into other ways to go through the entire tree one node at a time. Just finding it very difficult to generalize this solution so that it works sanely every time.
A solution to use strip_tags()
or the answer provided here isn't helpful to my use case, because the former does not allow me to control the attributes and the latter removes any tag that has attributes. I don't want to remove any tag that has an attribute. I want to explicitly allow certain tags but still have extensible control over what attributes can be kept/modified in the HTML.
Upvotes: 0
Views: 3606
Reputation: 11943
It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.
In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.
So I used wrote the following method as a simple generator extending from DOMDocument
.
class HTMLFixer extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
This way doing something like foreach($dom->walk($dom) as $node)
gives me a simple loop to traverse the entire tree. Of course this is a PHP 7 only solution because of the yield from
syntax, but I'm OK with that.
The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop. So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore()
to append the text nodes up the tree. That way removing those nodes later has no side effects.
So I added another generalized stripTags
method to this child class for DOMDocument
.
public function stripTags(DOMNode $node) {
$change = $remove = [];
/* Walk the entire tree to build a list of things that need removed */
foreach($this->walk($node) as $n) {
if ($n instanceof DOMText || $n instanceof DOMDocument) {
continue;
}
$this->stripAttributes($n); // strips all node attributes not allowed
$this->forceAttributes($n); // forces any required attributes
if (!in_array($n->nodeName, $this->allowedTags, true)) {
// track the disallowed node for removal
$remove[] = $n;
// we take all of its child nodes for modification later
foreach($n->childNodes as $child) {
$change[] = [$child, $n];
}
}
}
/* Go through the list of changes first so we don't break the
referential integrity of the tree */
foreach($change as list($a, $b)) {
$b->parentNode->insertBefore($a, $b);
}
/* Now we can safely remove the old nodes */
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
}
The trick here is because we use insertBefore
, on the child nodes (i.e. text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying). This confused me a lot at first, but looking at the way the method works, it makes sense. Deferring the move of the node makes sure we don't break parentNode
reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.
Here's the complete solution I came up with to more generally solve this problem. I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere. It allows you to specify which tags to allow, and all other tags are removed. It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).
class HTMLFixer extends DOMDocument {
protected static $defaultAllowedTags = [
'p',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'pre',
'code',
'blockquote',
'q',
'strong',
'em',
'del',
'img',
'a',
'table',
'thead',
'tbody',
'tfoot',
'tr',
'th',
'td',
'ul',
'ol',
'li',
];
protected static $defaultAllowedAttributes = [
'a' => ['href'],
'img' => ['src'],
'pre' => ['class'],
];
protected static $defaultForceAttributes = [
'a' => ['target' => '_blank'],
];
protected $allowedTags = [];
protected $allowedAttributes = [];
protected $forceAttributes = [];
public function __construct($version = null, $encoding = null, $allowedTags = [],
$allowedAttributes = [], $forceAttributes = []) {
$this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
$this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
$this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
parent::__construct($version, $encoding);
}
public function setAllowedTags(Array $tags) {
$this->allowedTags = $tags;
}
public function setAllowedAttributes(Array $attributes) {
$this->allowedAttributes = $attributes;
}
public function setForceAttributes(Array $attributes) {
$this->forceAttributes = $attributes;
}
public function getAllowedTags() {
return $this->allowedTags;
}
public function getAllowedAttributes() {
return $this->allowedAttributes;
}
public function getForceAttributes() {
return $this->forceAttributes;
}
public function saveHTML(DOMNode $node = null) {
if (!$node) {
$node = $this;
}
$this->stripTags($node);
return parent::saveHTML($node);
}
protected function stripTags(DOMNode $node) {
$change = $remove = [];
foreach($this->walk($node) as $n) {
if ($n instanceof DOMText || $n instanceof DOMDocument) {
continue;
}
$this->stripAttributes($n);
$this->forceAttributes($n);
if (!in_array($n->nodeName, $this->allowedTags, true)) {
$remove[] = $n;
foreach($n->childNodes as $child) {
$change[] = [$child, $n];
}
}
}
foreach($change as list($a, $b)) {
$b->parentNode->insertBefore($a, $b);
}
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
}
protected function stripAttributes(DOMNode $node) {
$attributes = $node->attributes;
$len = $attributes->length;
for ($i = $len - 1; $i >= 0; $i--) {
$attr = $attributes->item($i);
if (!isset($this->allowedAttributes[$node->nodeName]) ||
!in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
$node->removeAttributeNode($attr);
}
}
}
protected function forceAttributes(DOMNode $node) {
if (isset($this->forceAttributes[$node->nodeName])) {
foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
$node->setAttribute($attribute, $value);
}
}
}
protected function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
So if we have the following HTML
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>
And we only want to allow <p>
, and <em>
.
$html = <<<'HTML'
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>
HTML;
$dom = new HTMLFixer(null, null, ['p', 'em']);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $dom->saveHTML($dom);
We'd get something like this...
Some text... <p>Hello P<em>H</em>P!</p>
Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.
Upvotes: 2
Reputation: 16086
You can use strip_tags() like this:
$html = '<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>';
$updatedHTML = strip_tags($text,"<p><h1><h2><h3><ul><ol><li>");
//in second parameter we need to provide which html tag we need to retain.
You can get more information here: http://php.net/manual/en/function.strip-tags.php
Upvotes: 0