user14343548
user14343548

Reputation: 43

How would I modify a HTML string without touching the HTML elements?

Suppose I have this string:

$test = '<p>You are such a <strong class="Stack">helpful</strong> Stack Exchange user.</p>';

And then I naively replace any instance of "Stack" with "Flack", I will get this:

$test = '<p>You are such a <strong class="Flack">helpful</strong> Flack Exchange user.</p>';

Clearly, I did not want this. I only wanted to change the actual "content" -- not the HTML parts. I want this:

$test = '<p>You are such a <strong class="Stack">helpful</strong> Flack Exchange user.</p>';

For that to be possible, there has to be some kind of intelligent parsing going on. Something which first detects and picks out the HTML elements from the string, then makes the string replacement operation on the "pure" content string, and then somehow puts the HTML elements back, intact, in the right places.

My brain has been wrestling with this for quite some time now and I can't find any reasonable solution which wouldn't be hackish and error-prone.

It strikes me that this might exist as a feature built into PHP. Is that the case? Or is there some way I could accomplish this in a robust and sane way?

I would rather not try to replace all HTML parts with ____DO_NOT_TOUCH_1____, ____DO_NOT_TOUCH_2____, etc. It doesn't seem like the right way.

Upvotes: 3

Views: 143

Answers (1)

Benni
Benni

Reputation: 1033

You can do it as suggested by @04FS, with following recursive function:

function replaceText(DOMNode $node, string $search, string $replace) {
    if($node->hasChildNodes()) {
        foreach($node->childNodes as $child) {
            if ($child->nodeType == XML_TEXT_NODE) {
                $child->textContent = str_replace($search, $replace, $child->textContent);   
            } else {
                replaceText($child, $search, $replace);     
            }
        }
    }
}

As DOMDocument is a DOMNode, too, you can use it directly as a function argument:

$html =
    '<div class="foo">
        <span class="foo">foo</span>
        <span class="foo">foo</span>
        foo
    </div>';

$doc = new DOMDocument();
$doc->loadXML($html); // alternatively loadHTML(), will throw an error on invalid HTML tags

replaceText($doc, 'foo', 'bar');

echo $doc->saveXML();
// or
echo $doc->saveXML($doc->firstChild);
// ... to get rid of the leading XML version tag

Will output

<div class="foo">
    <span class="foo">bar</span>
    <span class="foo">bar</span>
    bar
</div>

Bonus: When you want to str_replace an attribute value

function replaceTextInAttribute(DOMNode $node, string $attribute_name, string $search, string $replace) {
    if ($node->hasAttributes()) {
        foreach ($node->attributes as $attr) {
            if($attr->nodeName === $attribute_name) {
                $attr->nodeValue = str_replace($search, $replace, $attr->nodeValue);
            }
        }   
    }
    if($node->hasChildNodes()) {
        foreach($node->childNodes as $child) {
            replaceTextInAttribute($child, $attribute_name, $search, $replace);     
        }
    }
}

Bonus 2: Make the function more extensible

function modifyText(DOMNode $node, callable $userFunc) {
    if($node->hasChildNodes()) {
        foreach($node->childNodes as $child) {
            if ($child->nodeType == XML_TEXT_NODE) {
                $child->textContent = $userFunc($child->textContent);   
            } else {
                modifyText($child, $userFunc);     
            }
        }
    }
}

modifyText(
    $doc, 
    function(string $string) {
        return strtoupper(str_replace('foo', 'bar', $string));
    }
);

echo $doc->saveXML($doc->firstChild);

Will output

<div class="foo">
    <span class="foo">BAR</span>
    <span class="foo">BAR</span>
    BAR
</div>

Upvotes: 4

Related Questions