Jason Axelrod
Jason Axelrod

Reputation: 7805

Regular Expressions, avoiding HTML tags in PHP

I have actually seen this question quite a bit here, but none of them are exactly what I want... Lets say I have the following phrase:

Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a <a href="somelink/TEST">TEST</a> link.

Okay, simple right? I am trying the following code:

$linkPin = '#(\b)TEST(\b)(?![^<]*>)#i';
$linkRpl = '$1<a href="newurl">TEST</a>$2';

$html = preg_replace($linkPin, $linkRpl, $html);

As you can see, it takes the word TEST, and replaces it with a link to test. The regular expression I am using right now works good to avoid replacing the TEST in line 2, it also avoids replacing the TEST in the href of line 3. However, it still replaces the text encapsulated within the tag on line 3 and I end up with:

Line 1 - This is a <a href="newurl">TEST</a> phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a <a href="somelink/TEST"><a href="newurl">TEST</a></a> link.

This I do not want as it creates bad code in line 3. I want to not only ignore matches inside of a tag, but also encapsulated by them. (remember to keep note of the /> in line 2)

Upvotes: 0

Views: 373

Answers (2)

Jason Axelrod
Jason Axelrod

Reputation: 7805

Okay... I think I came up with a better solution...

$noMatch = '(</a>|</h\d+>)';

$linkUrl = 'http://www.test.com/test/'.$link['page_slug'];
$linkPin = '#(?!(?:[^<]+>|[^>]+'.$noMatch.'))\b'.preg_quote($link['page_name']).'\b#i';
$linkRpl = '<a href="'.$linkUrl.'">'.$link['page_name'].'</a>';

$page['HTML'] = preg_replace($linkPin, $linkRpl, $page['HTML']);

With this code, it won't process any text within <a> tags and <h#> tags. I figure, any new exclusions I want to add, simply need to be added to $noMatch.

Am I wrong in this method?

Upvotes: 0

ircmaxell
ircmaxell

Reputation: 165191

Honestly, I'd do this with DomDocument and Xpath:

//First, create a simple html string around the text.
$html = '<html><body><div id="#content">'.$text.'</div></body></html>';

$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);

$query = '//*[not(name() = "a") and contains(., "TEST")]';
$nodes = $xpath->query($query);

//Force it to an array to break the reference so iterating works properly
$nodes = iterator_to_array($nodes); 
$replaceNode = function ($node) {
    $text = $node->wholeText;
    $text = str_replace('TEST', '<a href="TEST">TEST</a>', '');
    $fragment = $node->ownerDocument->createDocumentFragment();
    $fragment->appendXML($text);
    $node->parentNode->replaceChild($fragment, $node);
}

foreach ($nodes as $node) {
    if ($node instanceof DomText) {
        $replaceNode($node, 'TEST');
    } else {
        foreach ($node->childNodes as $child) {
            if ($child instanceof DomText) {
                $replaceNode($node, 'TEST');
            }
        }
    }
}

This should work for you, since it ignores all text inside of a elements, and only replaces the text directly inside of the matching tags.

Upvotes: 1

Related Questions