Reputation: 29
I'm stuck trying to write a regular expression in PHP that matches A HREF tags using capturing groups.
My current code looks like this:
$content = preg_replace_callback(
'/<a[^>]*href=["|\']([^"|\']*)["|\'][^>]*>([^<]*)<\/a>/i',
function($m) {
...
The code works perfectly fine for anything like this:
<a href="/go/bla" rel="sponsored noopener" target="_blank">Test link</a>
But I have some URLs that look like this - note the nested <span></span>
:
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span>Test link</span></a>
My second capturing group matches for ^< which is why the doesn't match. I was trying to change the group to match anything BUT . That's where I failed, thanks to my lack of regex experience :)
Could any regex expert please point me in the right direction?
Upvotes: 1
Views: 69
Reputation: 2993
The current regex should help you:
<a[^>]*href=["|\']([^"|\']*)["|\'][^>]*>(?:<[^>]+>)*([^<]*)(?:</[^>]+>)*<\/a>
This will match your example as well as this example:
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span><h1>Test link</h1></span></a>
However what about this?
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span><h1>Test <span>link</span></h1></span></a>
Nope! This breaks. And now we'll have to go back and wrap our minds around tags within tags with text outside those tags to still match, we'll have to break it up some more. At this stage it would be better to simply just fetch a list of all a
tags, and then perform some substitutions to extract the data you need after the fact.
$matches = preg_match_callback('/<a[^>]*?href=(.*?")[^>]*?>(.*?)</a>/i', function($m) {
... more regexes
}
It may be better to consider using a library that allows you to load html content as objects (much like a browser would) and query your results using something like xpath.
In PHP you can use the DOM and XPath to load html. Below is an example.
$doc = new DOMDocument();
$html = <<<EOD
<html>
<body>
<a href="/go/bla" rel="sponsored noopener" target="_blank">Test link</a>
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span>Test link</span></a>
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span><h1>Test <span>link</span></h1></span></a>
</body>
</html>
EOD;
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$query = $xpath->query('//a');
if (!is_null($query)) {
foreach ($query as $q) {
print $q->getAttribute('href') . ' - ';
print $q->nodeValue . "\n";
}
}
Upvotes: 0
Reputation: 4288
This should be sufficient for your example
<a[^>]*href=["|\']([^"|\']*)["|\'][^>]*>(?:<[^>]+>)?([^<]*)(?:<[^>]+>)?<\/a>
Adding the (?:<[^>]+>)?
will match the extra tags if they exists.
Upvotes: 2