Crawler with MozMill

Question

I have a beginner question : I'd like to write a crawler (~1000 webpages) with MozMill but too often, websites have problems loading some elements so there is no page load. --> The waitForPageLoad() method stops my crawler How could I proceed?

Kiril · Accepted Answer

The waitForPageLoad method is blocking, this means that the current executing thread will block until the execution of the method has completed. There are two ways to stop your application from blocking:

Specify a timeout.
Run multiple threads.

The documentation on waitForPageLoad indicates that there is a timeout value, so set the timeout to something reasonable and the function will return as soon as the page has been loaded or the timeout has expired:

void waitForPageLoad(
  in DOMDocument document,
  in int timeout,
  in int interval
);

The second option is to run multiple threads, which might be beneficial for you anyway. Each thread will be tasked with loading a page, processing it and selecting another page to load (from a queue of pages).

Crawler with MozMill

Answers (1)

Related Questions