Reputation: 91
Please help with regex to extract from tag a href only if rel="external nofollow"
<a href="text.html" rel="external nofollow">text1:text2:text3/</a>
only need as result get
text1:text2:text3
then trying
$regexp = '<a (?![^>]*?rel="external nofollow")[^>]*?href="(.*?)"';
I get error
Warning: preg_match() [function.preg-match]: Unknown modifier ']' in /
Upvotes: 1
Views: 909
Reputation: 70732
I suggest that you use DOM to parse and get your desired results. Below is an example for this.
<?php
$str = <<<STR
<a href="text.html" rel="external nofollow">foo bar</a>
<a href="text.html" rel="nofollow">text1:text2:text3/</a>
<a href="text.html" rel="nofollow">text1:text2:text3/</a>
<a href="example.html" rel="external nofollow">bar baz</a>
STR;
$dom = new DOMDocument;
$dom->loadHTML($str);
foreach ($dom->getElementsByTagName('a') as $node) {
if ($node->getAttribute('rel') == 'external nofollow') {
echo $node->getAttribute('href') . ', ' . $node->nodeValue . "\n";
}
}
?>
Output from example:
text.html, foo bar
example.html, bar baz
Upvotes: 3
Reputation: 785206
I strongly advise against use of regex for this type of task of parsing HTML. HTML can vary a lot and you can get unexpected results.
Consider using DOM parser in PHP
like this code:
$html = '<a href="found.html" rel="external nofollow">text1:text2:text3/</a>
<a href="notfound.html" rel="external">text11/</a>';
$doc = new DOMDocument();
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a[contains(@rel, 'external nofollow')]");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
echo $node->getAttribute('href') . "\n";
}
OUTPUT:
found.html
Upvotes: 3
Reputation: 71538
First, you have to get proper delimiters around your regex, a suitable one here is ~
:
$regexp = '~<a (?![^>]*?rel="external nofollow")[^>]*?href="(.*?)"~';
Second, this regex will be matching anything between the anchor tag and capture the link in href
and only if there's no rel="external nofollow"
in the anchor tag, which I thought was the opposite of what you're trying to do. Negative lookaheads prevent matches. You might want to change that regex completely to something like:
$regexp = '~<a[^>]*?rel="external nofollow"[^>]*>(.*?)</a>~';
Instead.
Upvotes: 0
Reputation: 28196
Try
preg_match('/<a.*rel="external nofollow"[^>]*>([^<]*)</a>/i',
$string_to_search_through, $res);
echo $res[1];
$res[1]
will give you the desired text.
Upvotes: 1