Maor Barazany
Maor Barazany

Reputation: 761

Regex match full hyperlink only with certain class

I have a string that has some hyperlinks inside. I want to match with regex only certain link from all of them. I can't know if the href or the class comes first, it may be vary. This is for example a sting:

<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>

I want to select from the aboce string only the one that has the class nextpostslink So, the match in this example should return this -

<a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a>

This regex is the most close I could get -

/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/

But it is selecting the links from the start of the string. I think my problem is in the (.*) , but I can't figure out how to change this to select only the needed link.

I would appreciate your help.

Upvotes: 1

Views: 3568

Answers (5)

Maxime Culea
Maxime Culea

Reputation: 109

As the question is to get it by regex, here is how <a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>.

It doesn't matter in which order are the attributs and it also consider simple or double quotes.

Check the regex online: https://regex101.com/r/DX03KD/1/

Upvotes: 0

Nicklas A.
Nicklas A.

Reputation: 7061

This would work in php:

/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m

This is of course assuming that the class attribute always comes after the href attribute.

This is a code snippet:

$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;

$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";

$matches = array();
if(preg_match($regexp, $html, $matches)) {
    echo "URL: " . $matches[2] . "\n";
    echo "Text: " . $matches[6] . "\n";
}

I would however suggest first matching the link and then getting the url so that the order of the attributes doesn't matter:

<?php

$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;

$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";

$matches = array();
if(preg_match($regexp, $html, $matches)) {
    $link = $matches[0];
    $text = $matches[4];

    $regexp = "/href=(\"|')([^'\"]*)(\"|')/";
    $matches = array();
    if(preg_match($regexp, $html, $matches)) {
        $url = $matches[2];

        echo "URL: $url\n";
        echo "Text: $text\n";
    }
}

You could of course extend the regexp by matching one of the both variants (class first vs href first) but it would be very long and I don't think it would be a performance increase.

Just as a proof of concept I created a regexp that doesn't care about the order:

/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m

The text will be in group 12 and the URL will be in either group 3 or group 10 depending on the order.

Upvotes: 0

Lucas de Oliveira
Lucas de Oliveira

Reputation: 1632

Not sure if that's what you're but anyway: it's a bad idea to parse html with regex. Use a xpath implementation in order to reach the desired elements. The following xpath expression would give you all the 'a' elements with class "nextpostlink" :

//a[contains(@class,"nextpostslink")]

There are loads of xpath info around, since you didn't mention your programming language here goes a quick xpath tutorial using java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

Edit:

php + xpath + html: http://dev.juokaz.com/php/web-scraping-with-php-and-xpath

Upvotes: 0

lonesomeday
lonesomeday

Reputation: 237845

It's much better to use a genuine HTML parser for this. Abandon all attempts to use regular expressions on HTML.

Use PHP's DOMDocument instead:

$dom = new DOMDocument;
$dom->loadHTML($yourHTML);

foreach ($dom->getElementsByTagName('a') as $link) {
    $classes = explode(' ', $link->getAttribute('class'));

    if (in_array('nextpostslink', $classes)) {
        // $link has the class "nextpostslink"
    }
}

Upvotes: 2

Eton B.
Eton B.

Reputation: 6281

I replaced the (.*) with [^'"]+ as follows:

<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>

Note: I tried this with RegEx Buddy so I didnt need to escape the <>'s or /

Upvotes: -1

Related Questions