mtylerb
mtylerb

Reputation: 25

Use preg_match_all to pull all <a href links that are NOT mailto: links

I'm trying to use preg_match_all to scan the source of a page and pull all links that are mailto: links into one array and all links that are not mailto: links into another array. Currently I'm using:

$searches = array('reg'=>'/href(=|=\'|=\")(?!mailto)(.+)\"/i','mailto'=>'/href(=|=\'|=\")(?=mailto)(.+)\"/i');
foreach ($searches as $key=>$search)
{
    preg_match_all($search,$source,$found[$key]);
}

The mailto: links search is working perfectly, but I can't find the reason why the non mailto: link search is pulling both mailto: and non-mailto: links, even with the negative look ahead assertion in place. What am I doing wrong?

Upvotes: 1

Views: 563

Answers (2)

alex
alex

Reputation: 490413

A saner solution that isn't so fragile would be to use DOMDocument...

$dom = new DOMDocument;

$dom->loadHTML($html);

$mailLinks = $nonMailLinks = array();

$a = $dom->getElementsByTagName('a');

foreach($a as $anchor) {
   if ($anchor->hasAttribute('href')) {
      $href = trim($anchor->getAttribute('href'));
      if (substr($href, 0, 7) == 'mailto:') {
            $mailLinks[] = $href;
      } else {
            $nonMailLinks[] = $href;
      }
   }
}

CodePad.

Upvotes: 2

mario
mario

Reputation: 145482

Your regex looks for the shortest alternative here:

 (=|=\'|=\")

You either need to sort that = last, or use the more common:

 =[\'\"]?

Alternatively / or otherwise exchange the .+? for the more explicit/restrictive [^\'\">]+. So the negative assertion won't fail against '"mailto:' as matched by .+

Upvotes: 0

Related Questions