Rafay
Rafay

Reputation: 6188

Resolving errors in PHP script with large execution time

I have implemented a web crawler that crawls and retrieves content from .edu TLD. The html content is being inserted into MySQL tables as the source code of the page. The script can go on for hours on a decent internet connection when a large number of seed urls are fed to the crawler. Now, my problem is that the script halts after crawling a number of links without giving any errors. I have used exception handling to handle "MySQL Server has gone away error" and has already eliminated a lot of problems and implemented if conditions that echo the errors if they are encountered. However I am not getting any errors. The problem is the halting of the script, whether I run it in the browser, Eclipse PDT or the CLI. Though it is worthy to note that the number of links crawled are somewhat different in all the three methods of running the script. I have altered the php.ini max_execution_time and other directives but this is not helping in anyway.

I have coded the script so that it resumes the crawling from where it halted, but I want the script to continue without halting so that I don't have to monitor whether the script is running or not.

Should I make changes to my Apache httpd.conf files. If yes, then what those settings should be??

The description in these links for my web crawler may help.

This is the code that retrieves html from url. This is from simple_html_dom.

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//    $contents = retrieve_url_contents($url);
if (empty($contents))
{
    return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}

Here is the error log for the following links:

And the crawler stopped after crawling this link:

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents(http://lms.nust.edu.pk) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:41] PHP Warning: file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated twice) ...

[01-Jan-2012 22:55:58] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ipo) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:58] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#tto) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:59] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ilo) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:59] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#mco) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:56:05] PHP Warning: file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated 18 times) ...

[01-Jan-2012 22:57:33] PHP Warning: file_get_contents(http://www.nust.edu.pk/#ctl00_SiteMapPath1_SkipLink) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:57:33] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 22:57:55] PHP Warning: file_get_contents(http://www.harvard.edu/#skip) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:21] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#undergrad) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:22] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#grad) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:24] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#continue) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:25] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#summer) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:04] PHP Warning: file_get_contents(http://www.harvard.edu/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated 1 time) ...

[01-Jan-2012 23:00:11] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents(http://directory.berkeley.edu) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:47] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents(http://students.berkeley.edu/uga/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents(http://publicservice.berkeley.edu/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents(http://students.berkeley.edu/osl/leadprogs.asp) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:17] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents(http://bearfacts.berkeley.edu/bearfacts) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents(http://career.berkeley.edu/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

And this is the error log from php-cgi.exe:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: php-cgi.exe
  Application Version:  5.3.8.0
  Application Timestamp:    4e537939
  Fault Module Name:    php5ts.dll
  Fault Module Version: 5.3.8.0
  Fault Module Timestamp:   4e537a04
  Exception Code:   c0000005
  Exception Offset: 0000c793
  OS Version:   6.1.7601.2.1.0.256.48
  Locale ID:    1033
  Additional Information 1: 0a9e
  Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
  Additional Information 3: 0a9e
  Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Please help me in this regard.

Upvotes: 2

Views: 1068

Answers (1)

rkosegi
rkosegi

Reputation: 14628

you should check call stack of php process (if running as CGI or CLI) or apache httpd process(if run as mod_php).

Then you will see in which module/procedure are execution halted. Also you can check active TCP/IP connection made by your script, maybe there is some ongoing IO operation which caused your script to halted.

I hope this helps.

Upvotes: 2

Related Questions