Reputation: 21
I have been playing around with cURL and xpath for some webscraping. I finally got my code running as I want but after trying on another side it stopped. The only thing I have changed is the path and url. I'm totally new and only been working with this for a week. Therefore, bear with me if it's an obvious fail.
My code is:
<?php
/*----Connection to Database----*/
include('wp-config.php');
mysql_connect(DB_HOST, DB_USER, DB_PASSWORD);
mysql_select_db("db");
/*----US Dollar Index----*/
$url = "http://www.wsj.com/mdc/public/page/2_3023-fut_index-futures.html";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// Make the cURL request
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// Parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// Grab all the MONTH on the page
$xpath = new DOMXPath($dom);
$data = $xpath->query("/html/body/div[6]/div[3]/div/table[9]/tbody/tr[position() >= 3 and position() <=6]");
//[position() >= 1 and position() <=13]
// Searching for data
$values = array();
foreach($data as $row) {
$values[] = $row->nodeValue;
}
print_r($values);
?>
</body>
</html>
Upvotes: 1
Views: 214
Reputation: 21
I solved my problem which was the path. The path firebug gave me wasn't the right one for the site. why I don't know.
Upvotes: 0
Reputation: 131
A few things come to mind. Have you checked what does the incoming html look like, does it have something that doesn't belong there? And is the xpath you're looking for correct? At least in this older answer it seems that the range for xpath should be given in form
[position() >= 100 and not(position() > 200)]
https://stackoverflow.com/a/3355022/5526468
Edit: And now that I think of it, it might be possible that if there are less than the desired amount of items in the actual html, maybe the xpath valuates the range expression as false and thus none are found with the query?
Upvotes: 1