Durgaprasad
Durgaprasad

Reputation: 323

How to extract specific type of links from website using php?

I am trying to extract specific type of links from the webpage using php

links are like following..

http://www.example.com/pages/12345667/some-texts-available-here

I want to extract all links like in the above format.

maindomain.com/pages/somenumbers/sometexts

So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?

Any suggestions ?

<?php $html = file_get_contents('http://www.example.com'); //Create a new DOM document $dom = new DOMDocument; @$dom->loadHTML($html); $links = $dom->getElementsByTagName('a'); //Iterate over the extracted links and display their URLs foreach ($links as $link){ //Extract and show the "href" attribute. echo $link->nodeValue; echo $link->getAttribute('href'), '<br>'; } ?>

Upvotes: 1

Views: 130

Answers (3)

Jan
Jan

Reputation: 43169

You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:

$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(@href, 'maindomain.com')]");

Loop over them afterwards:

foreach ($links as $link) {
    // do sth. with it here
    // after all, it is a DOMElement
}

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:

function checkURL($url) {
    $parts = parse_url($url);
    unset($parts['scheme']);

    if ( count($parts) == 2    &&
         isset($parts['host']) &&
         isset($parts['path']) &&
         preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
        return true;
    }
    return false;
}

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTMLFile($filename);

$xp = new DOMXPath($dom);

$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');

$links = $xp->query("//a[php:functionString('checkURL', @href)]");

foreach ($links as $link) {
    echo $link->getAttribute('href'), PHP_EOL;
}

In this way you extract only the links you want.

Upvotes: 2

Andreas
Andreas

Reputation: 23958

This is a slight guess, but if I got it wrong you can still see the way to do it.

foreach ($links as $link){
  //Extract and show the "href" attribute.
  If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
       echo $link->nodeValue;
       echo $link->getAttribute('href'), '<br>';
  }
}

Upvotes: 0

Related Questions