Valdemar Z
Valdemar Z

Reputation: 91

Regex extract from tag a href only if rel=

Please help with regex to extract from tag a href only if rel="external nofollow"

<a href="text.html" rel="external nofollow">text1:text2:text3/</a>

only need as result get

text1:text2:text3

then trying

$regexp = '<a (?![^>]*?rel="external nofollow")[^>]*?href="(.*?)"';

I get error

Warning: preg_match() [function.preg-match]: Unknown modifier ']' in /

Upvotes: 1

Views: 909

Answers (4)

hwnd
hwnd

Reputation: 70732

I suggest that you use DOM to parse and get your desired results. Below is an example for this.

<?php
$str = <<<STR
<a href="text.html" rel="external nofollow">foo bar</a>
<a href="text.html" rel="nofollow">text1:text2:text3/</a>
<a href="text.html" rel="nofollow">text1:text2:text3/</a>
<a href="example.html" rel="external nofollow">bar baz</a>
STR;

$dom = new DOMDocument;
$dom->loadHTML($str);

foreach ($dom->getElementsByTagName('a') as $node) {
   if ($node->getAttribute('rel') == 'external nofollow') {
     echo $node->getAttribute('href') . ', ' . $node->nodeValue . "\n"; 
   }
}
?>

Output from example:

text.html, foo bar
example.html, bar baz

Upvotes: 3

anubhava
anubhava

Reputation: 785206

I strongly advise against use of regex for this type of task of parsing HTML. HTML can vary a lot and you can get unexpected results.

Consider using DOM parser in PHP like this code:

$html = '<a href="found.html" rel="external nofollow">text1:text2:text3/</a>
         <a href="notfound.html" rel="external">text11/</a>';
$doc = new DOMDocument();
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a[contains(@rel, 'external nofollow')]");
for($i=0; $i < $nodelist->length; $i++) {
   $node = $nodelist->item($i);
   echo $node->getAttribute('href') . "\n";
}

OUTPUT:

found.html

Upvotes: 3

Jerry
Jerry

Reputation: 71538

First, you have to get proper delimiters around your regex, a suitable one here is ~:

$regexp = '~<a (?![^>]*?rel="external nofollow")[^>]*?href="(.*?)"~';

Second, this regex will be matching anything between the anchor tag and capture the link in href and only if there's no rel="external nofollow" in the anchor tag, which I thought was the opposite of what you're trying to do. Negative lookaheads prevent matches. You might want to change that regex completely to something like:

$regexp = '~<a[^>]*?rel="external nofollow"[^>]*>(.*?)</a>~';

Instead.

regex101 demo

Upvotes: 0

Carsten Massmann
Carsten Massmann

Reputation: 28196

Try

preg_match('/<a.*rel="external nofollow"[^>]*>([^<]*)</a>/i',
           $string_to_search_through, $res);
echo $res[1];

$res[1] will give you the desired text.

Upvotes: 1

Related Questions