How to remove unwanted HTML tags from user input but keep text inside the tags in PHP using DOMDocument

Question

I have around ~2 Million stored HTML pages in S3 that contain various HTML. I'm trying to extract only the content from those stored pages, but I wish to retain the HTML structure with certain constraints. This HTML is all user-supplied input and should be considered unsafe. So for display purposes, I want to retain only some of the HTML tags with a constraint on attributes and attribute values, but still retain all of the properly encoded text content inside even disallowed tags.

For example, I'd like to allow only specific tags like

,

, etc.. But I also want to keep whatever text is found between disallowed tags and maintain its structure. I also want to be able to restrict attributes in each tag or force certain attributes to be applied to specific tags.

For example, in the following HTML...
```
  Some text...
  Hello PHP!
```
I'd like the result to be...
```
  Some text...
  Hello PHP!
```
Thus stripping out the unwanted
and tags, the unwanted attributes of all tags, and still maintaining the text inside
and .

Simply using strip_tags() won't work here. So I tried doing the following with DOMDocuemnt.
```
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach($dom->childNodes as $node) {
    if ($node->nodeName != "p") { // only allow paragraph tags
        $text = $node->nodeValue;
        $node->parentNode->nodeValue .= $text;
        $node->parentNode->removeChild($node);
    }
}

echo $dom->saveHTML();
```
Which would work on simple cases where there aren't nested tags, but obviously fails when the HTML is complex.

I can't exactly call this function recursively on each of the node's child nodes because if I delete the node I lose all further nested children. Even if I defer node deletion until after the recursion the order of text insertion becomes tricky. Because I try to go deep and return all valid nodes then start concatenating the values of the invalid child nodes together and the result is really messy.

For example, let's say I want to allow
and in the following HTML

Hello there PHP!

But I don't want to allow . If the has nested my approach gets really confusing. Because I'd get something like ...

Hello there !PHP

Which is obviously wrong. I realized getting the entire nodeValue is a bad way of doing this. So instead I started digging into other ways to go through the entire tree one node at a time. Just finding it very difficult to generalize this solution so that it works sanely every time.

Update

A solution to use strip_tags() or the answer provided here isn't helpful to my use case, because the former does not allow me to control the attributes and the latter removes any tag that has attributes. I don't want to remove any tag that has an attribute. I want to explicitly allow certain tags but still have extensible control over what attributes can be kept/modified in the HTML.

Sherif · Accepted Answer

It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.

First, Walking the DOM Tree

In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.

So I used wrote the following method as a simple generator extending from DOMDocument.

class HTMLFixer extends DOMDocument {
    public function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

This way doing something like foreach($dom->walk($dom) as $node) gives me a simple loop to traverse the entire tree. Of course this is a PHP 7 only solution because of the yield from syntax, but I'm OK with that.

Second, Removing Tags but Keeping their Text

The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop. So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore() to append the text nodes up the tree. That way removing those nodes later has no side effects.

So I added another generalized stripTags method to this child class for DOMDocument.

public function stripTags(DOMNode $node) {
    $change = $remove = [];
    
    /* Walk the entire tree to build a list of things that need removed */
    foreach($this->walk($node) as $n) {
        if ($n instanceof DOMText || $n instanceof DOMDocument) {
            continue;
        }
        $this->stripAttributes($n); // strips all node attributes not allowed
        $this->forceAttributes($n); // forces any required attributes
        if (!in_array($n->nodeName, $this->allowedTags, true)) {
            // track the disallowed node for removal
            $remove[] = $n;
            // we take all of its child nodes for modification later
            foreach($n->childNodes as $child) {
                $change[] = [$child, $n];
            }
        }
    }
    
    /* Go through the list of changes first so we don't break the
       referential integrity of the tree */
    foreach($change as list($a, $b)) {
        $b->parentNode->insertBefore($a, $b);
    }

    /* Now we can safely remove the old nodes */
    foreach($remove as $a) {
        if ($a->parentNode) {
            $a->parentNode->removeChild($a);
        }
    }
}

The trick here is because we use insertBefore, on the child nodes (i.e. text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying). This confused me a lot at first, but looking at the way the method works, it makes sense. Deferring the move of the node makes sure we don't break parentNode reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.

Complete Solution

Here's the complete solution I came up with to more generally solve this problem. I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere. It allows you to specify which tags to allow, and all other tags are removed. It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).

class HTMLFixer extends DOMDocument {
    protected static $defaultAllowedTags = [
        'p',
        'h1',
        'h2',
        'h3',
        'h4',
        'h5',
        'h6',
        'pre',
        'code',
        'blockquote',
        'q',
        'strong',
        'em',
        'del',
        'img',
        'a',
        'table',
        'thead',
        'tbody',
        'tfoot',
        'tr',
        'th',
        'td',
        'ul',
        'ol',
        'li',
    ];
    protected static $defaultAllowedAttributes = [
        'a'   => ['href'],
        'img' => ['src'],
        'pre' => ['class'],
    ];
    protected static $defaultForceAttributes = [
        'a' => ['target' => '_blank'],
    ];

    protected $allowedTags       = [];
    protected $allowedAttributes = [];
    protected $forceAttributes   = [];

    public function __construct($version = null, $encoding = null, $allowedTags = [],
                                $allowedAttributes = [], $forceAttributes = []) {
        $this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
        $this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
        $this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
        parent::__construct($version, $encoding);
    }

    public function setAllowedTags(Array $tags) {
        $this->allowedTags = $tags;
    }

    public function setAllowedAttributes(Array $attributes) {
        $this->allowedAttributes = $attributes;
    }

    public function setForceAttributes(Array $attributes) {
        $this->forceAttributes = $attributes;
    }

    public function getAllowedTags() {
        return $this->allowedTags;
    }

    public function getAllowedAttributes() {
        return $this->allowedAttributes;
    }

    public function getForceAttributes() {
        return $this->forceAttributes;
    }

    public function saveHTML(DOMNode $node = null) {
        if (!$node) {
            $node = $this;
        }
        $this->stripTags($node);
        return parent::saveHTML($node);
    }

    protected function stripTags(DOMNode $node) {
        $change = $remove = [];
        foreach($this->walk($node) as $n) {
            if ($n instanceof DOMText || $n instanceof DOMDocument) {
                continue;
            }
            $this->stripAttributes($n);
            $this->forceAttributes($n);
            if (!in_array($n->nodeName, $this->allowedTags, true)) {
                $remove[] = $n;
                foreach($n->childNodes as $child) {
                    $change[] = [$child, $n];
                }
            }
        }
        foreach($change as list($a, $b)) {
            $b->parentNode->insertBefore($a, $b);
        }
        foreach($remove as $a) {
            if ($a->parentNode) {
                $a->parentNode->removeChild($a);
            }
        }
    }

    protected function stripAttributes(DOMNode $node) {
        $attributes = $node->attributes;
        $len = $attributes->length;
        for ($i = $len - 1; $i >= 0; $i--) {
            $attr = $attributes->item($i);
            if (!isset($this->allowedAttributes[$node->nodeName]) ||
                !in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
                $node->removeAttributeNode($attr);
            }
        }
    }

    protected function forceAttributes(DOMNode $node) {
        if (isset($this->forceAttributes[$node->nodeName])) {
            foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
                $node->setAttribute($attribute, $value);
            }
        }
    }

    protected function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

So if we have the following HTML


  Some text...
  Hello PHP!

And we only want to allow

, and .

$html = <<<'HTML' Some text... Hello PHP! HTML; $dom = new HTMLFixer(null, null, ['p', 'em']); $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); echo $dom->saveHTML($dom);

We'd get something like this...

Some text...
Hello PHP!

Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.

How to remove unwanted HTML tags from user input but keep text inside the tags in PHP using DOMDocument

Update

Answers (2)

First, Walking the DOM Tree

Second, Removing Tags but Keeping their Text

Complete Solution

Related Questions