Cesare
Cesare

Reputation: 1749

Get rigth Xpath for HTML elements

I need to scrape this HTML page ...

http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3

enter image description here

.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"

(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )

I'm using this PHP code sample to print the value ...

<?php
    ini_set('display_errors', 'On');
    error_reporting(E_ALL);

    include "./tmp/vendor/autoload.php";

    $url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';

    //$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';

    $xpath_for_parsing = '//*[@id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';

    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);
    $colorWaitingNumber = $xpath->query($xpath_for_parsing);
    $theValue =  'N.D.';
    foreach( $colorWaitingNumber as $node )
    {
      $theValue = $node->nodeValue;
    }

    print $theValue;

?>

I've extracted the xpath using both the Chrome and Firefox web consoles ...

Suggestions / examples?

Upvotes: 0

Views: 54

Answers (2)

Nigel Ren
Nigel Ren

Reputation: 57121

Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.

$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';

This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.

Upvotes: 1

Matey
Matey

Reputation: 1210

Both Chrome and Firefox most probably improve the original HTML by adding <tbody> elements inside <table> because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:

$xpath_for_parsing = '//*[@id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';

Upvotes: 1

Related Questions