Reputation: 339
If I put into the for loop $number <= 25, or <= 50, or <= 75, this script works exactly how I want it to, if however I put 100 or higher it throws an error:
Fatal error: Allowed memory size of 67108864 bytes exhausted (tried to allocate 24 bytes) in /public_html/php-scraper/simplehtmldom_1_5/simple_html_dom.php on line 1074
Is there a better way to code this? It doesn't seem like what I'm doing is at all weird for getting data off a web site. Do I need to allocate or initialize more memory (no experience with this). It just doesn't seem like its a complicated php task. Also I have no idea what this web page is, I'm just doing this for a company and they picked this catalog.
Thanks for reading. Here's the code
<?php
include_once 'simple_html_dom.php';
echo 'Category, AvnetPartNumber, Manufacturer, Price, Availability,';
//number is the value of which item the page starts with
for($number = 0; $number <= 100; $number = $number + 25){
// Create DOM from URL or file
$url = "http://avnetexpress.avnet.com/store/em/EMController/_/N-?Ns=PartNumber|0&action=excess_inventory&catalogId=&cutTape=&inStock=&langId=-1&myCatalog=&npi=&proto=®ionalStock=&rohs=&storeId=500201&term=&topSellers=&No=".$number;
$html = file_get_html($url);
$i = 0;
// Find all images
foreach($html->find('td[class=small dataTd]') as $element) {
if($i == 1 || $i == 3 || $i == 4 || $i == 7 || $i == 8){
echo $element->plaintext . ',' ;
}
if($i == 8){
$i =0;
}
else{
$i++;
}
}
}
?>
Upvotes: 0
Views: 1000
Reputation: 46620
I would do it slightly differently, firstly I would use curl (2 reasons its faster and you can look like a normal browser, by setting useragent) and finally not bother with simple_html_dom
, what you can do with that you can do with PHP inbuilt domDocument.
Also you dont want to reset the $i
& 8 as there are 10 columns in each row, this would skew your result so resetting on the 9 will create the new row as expected, in my example I put all the data in an array, but you should put it in a database ect, and as you can see for 4 pages its peak memory usage is 1.40MB, hope it helps.
<?php
$url = 'http://avnetexpress.avnet.com/store/em/EMController/_/N-?Nn=50&Ns=PartNumber|0&action=excess_inventory&catalogId=&cutTape=&inStock=&langId=-1&myCatalog=&npi=&proto=®ionalStock=&rohs=&storeId=500201&term=&topSellers=&No=';
//4 pages
$result = run_scrap($url,100,25);
//Memory usage
$memory = array();
$memory['used'] = getReadableFileSize(memory_get_peak_usage());
$memory['total'] = ini_get("memory_limit").'B';
print_r($result);
print_r($memory); //Array ( [used] => 1.40 MB [total] => 128MB )
/** Result
* Array
(
[0] => Array
(
[title] => Logic and Timing - Crystals
[partnum] => ##BP11DCRK430
[manufactuere] => TOKO America
[price] => $0.3149
[availability] => 4500 Stock
)
[1] => Array
(
[title] => Inductor - Inductor Leaded
[partnum] => #187LY-471J
[manufactuere] => TOKO America
[price] => $0.3149
[availability] => 100 Stock
)
...
*/
function run_scrap($url,$total_items=100,$step=25){
$range = range(0,$total_items,$step);
$result = array();
foreach($range as $page){
$src = curl_get($url.$page);
$result = array_merge($result,process($src));
}
return $result;
}
function process($src){
$return = array();
$dom = new DOMDocument("1.0","UTF-8");
@$dom->loadHTML($src);
$dom->preserveWhiteSpace = false;
$return = array();
$i=0;
$r=0;
foreach($dom->getElementsByTagName('td') as $ret) {
if($ret->getAttribute('class') == 'small dataTd'){
switch($i){
case 1:
$return[$r]['title'] = trim($ret->nodeValue);
break;
case 3:
$return[$r]['partnum'] = trim($ret->nodeValue);
break;
case 4:
$return[$r]['manufactuere'] = trim($ret->nodeValue);
break;
case 7:
$return[$r]['price'] = trim($ret->nodeValue);
break;
case 8:
$return[$r]['availability'] = trim($ret->nodeValue);
break;
default:
break;
}
//Reset after col 9
if($i == 9){
$i = 0;
$r++;
}else{
$i++;
}
}
}
return $return;
}
function curl_get($url){
$return = '';
(function_exists('curl_init')) ? '' : die('cURL Must be installed!');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/json,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
//Debug Function - not related to the scrapper
function getReadableFileSize($size, $retstring = null) {
$sizes = array('bytes', 'kB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB');
if ($retstring === null) { $retstring = '%01.2f %s'; }
$lastsizestring = end($sizes);
foreach ($sizes as $sizestring) {
if ($size < 1024) { break; }
if ($sizestring != $lastsizestring) { $size /= 1024; }
}
if ($sizestring == $sizes[0]) { $retstring = '%01d %s'; }
return sprintf($retstring, $size, $sizestring);
}
?>
Upvotes: 2
Reputation: 1960
<?
require_once("SimpleHtmlDom/simple_html_dom.php");
$_htmlDom = new simple_html_dom();
echo 'Category, AvnetPartNumber, Manufacturer, Price, Availability,';
//number is the value of which item the page starts with
for($number = 0; $number <= 200; $number += 25){
// Create DOM from URL or file
$url = "http://avnetexpress.avnet.com/store/em/EMController/_/N-?Ns=PartNumber|0&action=excess_inventory&catalogId=&cutTape=&inStock=&langId=-1&myCatalog=&npi=&proto=®ionalStock=&rohs=&storeId=500201&term=&topSellers=&No=".$number;
$html = file_get_contents($url);
$_htmlDom->load($html);
$i = 0;
$elementList = $_htmlDom->find('td[class=small dataTd]');
// Find all images
foreach($elementList as $element) {
if($i == 1 || $i == 3 || $i == 4 || $i == 7 || $i == 8){
echo $element->plaintext . ',' ;
}
if($i == 8){
$i = 0;
}else{
$i++;
}
flush();
}
}
?>
This version tested in 128MB RAM NAS (actually it less than 80MB RAM), it's work.
I just modify things:
Upvotes: 1
Reputation: 69967
The problem doesn't seem to be related to your code. When you increase $number
you are increasing (what appears to be) the number of results that get returned by the search.
The more results you return, the larger the resulting web page is, and therefore you end up with a lot more DOM nodes and links. The problem is that you are running out of memory (in PHP) when you try to call $html->find()
. I'm not sure how that parser works but it probably parses all of the nodes into memory when you load the script.
The solution is to up PHP's memory limit or pull less than 100 results each request since that seems to be the point at which you run out of memory.
You can increase your memory limit by calling ini_set('memory_limit', '128M');
at the beginning of your script. Note: I just picked 128M
out of nowhere. Set that to whatever you think it needs to be.
Upvotes: 0