Reputation: 9837
The following function receives a string parameter representing an url and then loads the url in a simple_html_dom object. If the loading fails, it attemps to load the url again.
public function getSimpleHtmlDomLoaded($url)
{
$ret = false;
$count = 1;
$max_attemps = 10;
while ($ret === false) {
$html = new simple_html_dom();
$ret = $html->load_file($url);
if ($ret === false) {
echo "Error loading url: $url\n";
sleep(5);
$count++;
$html->clear();
unset($html);
if ($count > $max_attemps)
return false;
}
}
return $html;
}
However, if the url loading fails one time, it keeps failing for the current url, and after the max attemps
are over, it also keeps failing in the next calls to the function with the rest of the urls it has to process.
It would make sense to keep failing if the urls were temporarily offline, but they are not (I've checked while the script was running).
Any ideas why this is not working properly?
I would also like to point out, that when starts failing to load the urls, it only gives a warning (instead of multiple ones), with the following message:
PHP Warning: file_get_contents(http://www.foo.com/resource): failed to open stream: HTTP request failed! in simple_html_dom.php on line 1081
Which is prompted by this line of code:
$ret = $html->load_file($url);
Upvotes: 0
Views: 273
Reputation: 1
may be it is a problem of load_file() function itself.
Problem was, that the function error_get_last() returns all privious erros too, don't know, may be depending on PHP version?
I solved the problem by changing it to (check if error changed, not if it is null) (or use the non object function: file_get_html()):
function load_file()
{
$preerror=error_get_last();
$args = func_get_args();
$this->load(call_user_func_array('file_get_contents', $args), true);
// Throw an error if we can't properly load the dom.
if (($error=error_get_last())!==$preerror) {
$this->clear();
return false;
}
}
Upvotes: 0
Reputation: 3998
I have tested your code and it works perfectly for me, every time I call that function it returns valid result from the first time.
So even if you load the pages from the same domain there can be some protection on the page or server. For example page can look for some cookies, or the server can look for your user agent and if it see you as an bot it would not serve correct content.
I had similar problems while parsing some websites. Answer for me was to see what is some page/server expecting and make my code simulate that. Everything, from faking user agent to generating cookies and such.
By the way have you tried to create a simple php script just to test that 'simple html dom' parser can be run on your server with no errors? That is the first thing I would check.
On the end I must add that in one case, while I failed in numerous tries for parsing one page, and I could not win the masking game. On the end I made an script that loads that page in linux command line text browser lynx and saved the whole page locally and then I parsed that local file which worked perfect.
Upvotes: 1