Reputation: 323
I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
@$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
Upvotes: 1
Views: 130
Reputation: 43169
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with()
as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(@href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}
Upvotes: 0
Reputation: 89547
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions
to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', @href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
Upvotes: 2
Reputation: 23958
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
Upvotes: 0