Indra
Indra

Reputation: 53

How to run PHP process for longer time

I am working on web scraping with php and curl to scrap a whole website

but it takes more than one day to complete the process of scraping

I have even used

ignore_user_abort(true);
set_error_handler(array(&$this, 'customError'));
set_time_limit (0);
ini_set('memory_limit', '-1');

I have also cleared memory after scraping a page I am using simple html DOM to get the scraping details from a page

But still process runs and works fine for some amount of links after that it stops although process keeps circulating the browser and no error log is generated

Could not understand what seems to be the problem.
Also I need to know if PHP can run process for two or three days?

thanks in advance

Upvotes: 0

Views: 85

Answers (2)

Jake Whiteley
Jake Whiteley

Reputation: 505

PHP can run for as long as you need it to, but the fact it stops after what seems like the same point every time indicates there is an issue with your script.

You said you have tried ignore_user_abort(true);, but then indicated you were running this via a browser. This setting only works in command line as closing a browser window for a script of this type will not terminate the process anyway.

Do you have xDebug? simplehtmlDOM will throw some rather interesting errors with malformed html (a link within a broken link for example). xDebug will throw a MAX_NESTING_LEVEL error in a browser, but will not throw this in a console unless you have explicitly told it to with the -d flag.

There are lots of other errors, notices, warnings etc which will break/stop your script without writing anything to error_log.

Are you getting any errors?

When using cURL in this way it is important to use multi cURL to parallel process URLs - depending on your environment, 150-200 URLs at a time is easy to achieve.

If you have truly sorted out the memory issue and freed all available space like you have indicated, then the issue must be with a particular page it is crawling.

I would suggest running your script via a console and finding out exactly when it stops to run that URL separately - at least this will indicate if it is a memory issue or not.

Also remember that set_error_handler(array(&$this, 'customError')); will NOT catch every type of error PHP can throw.

When you next run it, debug via a console to show progress, and keep a track of actual memory use - either via PHP (printed to console) or via your systems process manager. This way you will be closer to finding out what the actual issue with your script is.

Upvotes: 1

Carlos M. Meyer
Carlos M. Meyer

Reputation: 446

Even if you set an unlimited memory, there exists a physical limit.

If you call recursively the URLs, the memory can be fullfilled.

Try to do a loop and work with a database:

scan a page, store the founded links if there aren't in the database yet. when finish, do a select, and get the first unscanned URL {loop}

Upvotes: 0

Related Questions