Reputation: 25
I'm trying to use preg_match_all
to scan the source of a page and pull all links that are mailto: links into one array and all links that are not mailto: links into another array. Currently I'm using:
$searches = array('reg'=>'/href(=|=\'|=\")(?!mailto)(.+)\"/i','mailto'=>'/href(=|=\'|=\")(?=mailto)(.+)\"/i');
foreach ($searches as $key=>$search)
{
preg_match_all($search,$source,$found[$key]);
}
The mailto: links search is working perfectly, but I can't find the reason why the non mailto: link search is pulling both mailto: and non-mailto: links, even with the negative look ahead assertion in place. What am I doing wrong?
Upvotes: 1
Views: 563
Reputation: 490413
A saner solution that isn't so fragile would be to use DOMDocument...
$dom = new DOMDocument;
$dom->loadHTML($html);
$mailLinks = $nonMailLinks = array();
$a = $dom->getElementsByTagName('a');
foreach($a as $anchor) {
if ($anchor->hasAttribute('href')) {
$href = trim($anchor->getAttribute('href'));
if (substr($href, 0, 7) == 'mailto:') {
$mailLinks[] = $href;
} else {
$nonMailLinks[] = $href;
}
}
}
Upvotes: 2
Reputation: 145482
Your regex looks for the shortest alternative here:
(=|=\'|=\")
You either need to sort that =
last, or use the more common:
=[\'\"]?
Alternatively / or otherwise exchange the .+?
for the more explicit/restrictive [^\'\">]+
. So the negative assertion won't fail against '"mailto:
' as matched by .+
Upvotes: 0