jmadsen
jmadsen

Reputation: 3675

regex to replace mailto: hrefs but ignore site links

I need some help to tweak this regular expression:

$content = 'more <a href="http://www.test.com">test</a> test <a href="mailto:[email protected]">Jeff</a> this is a <a href="http://www.test.com">test</a>';

$content = preg_replace("~<a .*?href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>~", "$1", $content); 

This expression is to strip the html markup off a mailto link and just return the email ([email protected])

It works fine except for in the example I gave above - because a unlimited number of whitespaces is allowed before the href in the pattern, when a website link is before the mailto link, the regex looks all the way forward until it finds the mailto: in the following link and removes all the content in between.

maybe a fix would be to just limit it to two or three whitespaces after the opening tag so as to not look so far ahead, but i wonder if there is a better solution from people who know regex better than I?

Upvotes: 1

Views: 2569

Answers (2)

alex
alex

Reputation: 490403

Here is what you should be using...

$dom = new DOMDocument;

$dom->loadHTML($content);

foreach($dom->getElementsByTagName('a') as $a) {
    if ($a->hasAttribute('href') 
        AND strpos($href = trim($a->getAttribute('href')), 'mailto:') === 0) {

         $textNode = $dom->createTextNode(substr($href, 7));
         $parent = $a->parentNode;
         $parent->insertBefore($textNode, $a);
         $parent->removeChild($a); 

    }   
}

CodePad.

$dom->saveHTML() adds all the HTML boiler plate stuff such as html and body element, you can remove them with...

$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
    $html .= $dom->saveHTML($node);
}

CodePad.

Upvotes: 6

stema
stema

Reputation: 93026

The problem is not to allow any amount of whitespace, that would be working. The problem is you allow one space and any amount of ANY character with your <a .*

If you fix this and allow really only whitespace like this

<a\s+href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>

it seems to work.

See it here at Regexr

But probably you should have a closer look at alex answer (+1 for the example) as this would be the cleaner solution.

Upvotes: 1

Related Questions