Reputation: 1749
I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"
(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
include "./tmp/vendor/autoload.php";
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
//$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
$xpath_for_parsing = '//*[@id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
I've extracted the xpath using both the Chrome and Firefox web consoles ...
Suggestions / examples?
Upvotes: 0
Views: 54
Reputation: 57121
Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.
$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';
This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.
Upvotes: 1
Reputation: 1210
Both Chrome and Firefox most probably improve the original HTML by adding <tbody>
elements inside <table>
because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:
$xpath_for_parsing = '//*[@id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';
Upvotes: 1