Doktor13
Doktor13

Reputation: 41

XPath Query & HTML - Find Specific HREF's Within Anchor Tags

I've got the HTML data required in a DOMDocument and DOMXPath.

But I need to access and retrieve the href values in certain <a> tags. The following is the criteria:

  1. href contains: some-site.vendor.com/jobs/[#idnumber]/job (i.e. some-site.vendor.com/jobs/23094/job)

  2. href contains not: some-site.vendor.com/jobs/search?search=pr2

  3. href contains not: some-site.vendor.com/jobs/intro

  4. href contains not: www.someothersite.com/

  5. href contains not: media.someothersite.com/

  6. href contains not: javascript:void(0)

Either of these (similar) queries fetches everything but 4-6 - that's a good thing:

$joblinks = $xpath->query('//a[@href[contains(., "https://some-site.vendor.com/jobs/")]]');    
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');

Ultimately however I need to access all the anchor tags containing href's like #1, and assign the actual href values within to a variable/array. Here's what I'm doing:

$payload = fetchRemoteData(SPEC_SOURCE_URL);

// suppress warning(s) due to malformed markup
libxml_use_internal_errors(true);

// load the fetched contents
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($payload);

// parse and cache the required data elements
$xpath = new DOMXPath($dom);

//$joblinks = $xpath->query('//a[@href[contains(., "some-site.vendor.com/jobs/")]]');
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');
foreach($joblinks as $joblink) {
    var_dump(trim($joblink->nodeValue)); // dump hrefs here!
}
echo "\n";

This is really beating me up - I'm close but I just can't seem to tweak the query correctly and/or access the actual href values. My humblest apologies if I've not followed protocol of any sorts for this question...

ANY/ALL help would be greatly appreciated! Thanx SO MUCH in advance!

Upvotes: 1

Views: 2418

Answers (1)

hakre
hakre

Reputation: 197659

Doing this solely with xpath I would not suggest. First of all you have a whitelist and a blacklist. It's not really clear what you want so I assume this can change over time.

So what you can do is to first select all href attributes in question and return the nodes. That's what Xpath is very good for, so let's use xpath:

if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

You now have the common DOMNodeList in $links and it contains of zero or more DOMAttr elements as we have selected those. These now needs the filtering you're looking for.

So you have some critera you want to match. You have verbose but not very specific how that should work. You have a positive match but also negative matches. But in both cases you don't tell what should happen if not. So I do a shortcut here: You write yourself a function that returns either true or false if a "href" string matches the criteria(s):

function is_valid_href($href) {

    // do whatever you see fit ...

    return true or false;
}

So the problem of telling whether a href is now valid or not has been solved. Best thing: You can change it later.

So all what's needed is to integrate that with the links is to get all links in their normalized and absolute form. This means more data processing, see:

for more details about the different types of URL normalization.

So we create another function that encapsulates away href normalization, base resolution and validation. In case the href is wrong, it just returns null, otherwise the normalized href:

function normalize_href($href, $base) {

    // do whatever is needed ...

    return null or "href string";
}

Let's put this together, in my case I even make the href a Net_URL2 instance so the validator can benefit from it.

Naturally if you wrap this up into closures or some classes, it get's a nicer interface. Also you couold consider to make the xpath expression a parameter as well:

// get all href
if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

// set a base URL
$base = 'https://stackoverflow.com/questions/9894956/xpath-query-html-find-specific-hrefs-within-anchor-tags';

/**
 * @return bool
 */
function is_valid_href($href) {    
    ...
}

/**
 * @return href
 */
function normalize_href($href, $base) {
    ...
}

$joblinks = array();
foreach ($links as $attr) {
    $href = normalize_href($attr->nodeValue, $base);
    if (is_valid_href($href)) {
        $joblinks[] = $href;
    }
}

// your result is in:
var_dump($joblinks);

I've run an example on this website, and the result is:

array(122) {
  [0]=>
  object(Net_URL2)#129 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(17) "stackexchange.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(1) "/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
  [1]=> 

  ...

  [121]=>
  object(Net_URL2)#250 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(22) "blog.stackoverflow.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(30) "/2009/06/attribution-required/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
}

Upvotes: 1

Related Questions