Reputation: 21150
I'm building a (relatively) simple web scraper using PHP/CURL. This is my first time using PHP, I've tested this code in ScraperWiki and it worked just fine but I'm trying to use it on my own server and it's not running. I know the script is being read, because if I remove the simple_html_dom include I get error messages. But when it's included, I get a 500 server error.
Don't really know where to start trouble shooting here. Would appreciate someone looking over the code to see if there are any obvious errors? At present I just want the page to print variables on the screen so I know it's working properly, then I'm going to hook it up to mysql. So this is just in a folder on my server, along with simple_html_dom.php, and I'm accessing it by going to the www.domain.com/folder/index.php which houses the following code:
<?php
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
$allLinks = array();
$counter = 0;
function nextPage($nextUrl){
global $counter;
getLinks($nextUrl);
}
function getLinks($url){ // gets links from product list page
global $allLinks;
global $counter;
$html_content = curl($url);
$html = str_get_html($html_content);
foreach ($html->find("div.views-row a.imagecache-product_list") as $el) {
$url = $el->href . "\n";
$allLinks[$counter] = 'http://www.uptherestore.com';
$allLinks[$counter] .= $url;
$counter++;
}
$next = $html->find("li.pager-next a", 0);
if( $next != false ) $next = $next->href;
if (isset($next)) {
$nextUrl = 'http://www.uptherestore.com';
$nextUrl .= $next;
nextPage($nextUrl);
}else{return;}
}
class Product{ //Creates an object class for products
public $name = '';
public $infoLink = '';
public $description = '';
public $mainImage = '';
public $moreImages1 = '';
public $moreImages2 = '';
public $moreImages3 = '';
public $moreImages4 = '';
public $price = '';
public $designer= '';
}
function getInfo($infoLink){ // Trawls the product pages for info
if(!(isset($i)))
{$i = 0;}
$the_content = curl($infoLink);
$the_html = str_get_html($the_content);
$productName = $the_html->find("#item_info h1", 0)->innertext;
$products[$productName] = new Product;
$products[$productName]->name = $productName;
$products[$productName]->infoLink = $infoLink;
$products[$productName]->designer = $the_html->find("#item_info h2", 0)->innertext;
$products[$productName]->description = $the_html->find("#item_info .product-body", 0)->innertext; //Might cause issues because there are multiple <p> tags in this div
$products[$productName]->mainImage = $the_html->find("#item_image .imagecache-product_item_default", 0)->src;
$more1 = $the_html->find(".extra_images", 0);
$more2 = $the_html->find(".extra_images", 1);
$more3 = $the_html->find(".extra_images", 2);
$more4 = $the_html->find(".extra_images", 3);
if (isset($more1)) {
$products[$productName]->moreImages1 = $more1->src;
}
if (isset($more2)) {
$products[$productName]->moreImages1 = $more2->src;
}
if (isset($more3)) {
$products[$productName]->moreImages1 = $more3->src;
}
if (isset($more4)) {
$products[$productName]->moreImages1 = $more4->src;
}
$products[$productName]->price = $the_html->find(".price", 0)->innertext;
// Store: $infoLink, $description, $mainImage, $moreImages, $price, $designer
echo $products[$productName]->name . "\n";
echo $products[$productName]->description . "\n";
echo $i;
$i++;
}
getLinks("http://www.uptherestore.com/department/accessories");
foreach ($allLinks as $key => $value) {
getInfo($value);
}
?>
Any help would be greatly appreciated.
Upvotes: 0
Views: 515
Reputation: 2068
Quite difficult to determine what could be going wrong if the only feedback you're getting from it is an internal server error. I'd try putting in some error_log calls or echo/print to find out at what point it stops running.
One thing I do notice, however, is that you're checking if (isset($more1)) {
when you set the $more
variables to the result of $the_html->find
From looking at the docs for the find method in simple html dom parser, it will return null if it cannot find an element, so the check should be if (!is_null($more1)) {
You could see if that solves the issue, but if not, I'd recommend putting in some logging or checking server/php logs.
Upvotes: 1