fractal5
fractal5

Reputation: 2132

Get hrefs that match regex expression using PHP & XPath

I have a page that contains several hyperlinks. The ones I want to get are of the format:

<html>
<body>

<div id="diva">
<a href="/123" >text2</a>
</div>

<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>

</body>
</html>

I want to extract the three hrefs 123,345,and 678.

I know how to get all the hyperlinks using $gm = $xpath->query("//a") and then loop through them to get the href attribute.

Is there some sort of regexp to get the attributes with the above format only (.i.e "/digits")?

Thanks

Upvotes: 2

Views: 872

Answers (1)

har07
har07

Reputation: 89285

XPath 1.0, which is the version supported by DOMXPath(), has no Regex functionalities. Though, you can easily write your own PHP function to execute Regex expression to be called from DOMXPath if you need one, as mentioned in this other answer.

There is XPath 1.0 way to test if an attribute value is a number, which you can use on href attribute value after / character, to test if the attribute value follows the pattern /digits :

//a[number(substring-after(@href,'/')) = substring-after(@href,'/')]

UPDATE :

For the sake of completeness, here is a working example of calling PHP function preg_match from DOMXPath::query() to accomplish the same task :

$raw_data = <<<XML
<html>
<body>

<div id="diva">
<a href="/123" >text2</a>
</div>

<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>

</body>
</html>
XML;
$doc = new DOMDocument;
$doc->loadXML($raw_data);

$xpath = new DOMXPath($doc);

$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("preg_match");

// php:function's parameters below are :
// parameter 1: PHP function name
// parameter 2: PHP function's 1st parameter, the pattern
// parameter 3: PHP function's 2nd parameter, the string
$gm = $xpath->query("//a[php:function('preg_match', '~^/\d+$~', string(@href))]");

foreach ($gm as $a) {
    echo $a->getAttribute("href") . "\n";
}

Upvotes: 3

Related Questions