Domain-specific crawling with different settings for each domain (e.g. speed) using Storm crawler

Question

I have discovered Storm crawler only recently and from the past experience and studies and work with different crawlers I find this project based on Apache Storm pretty robust and suitable for many use cases and scenarios.

I have read some tutorials and tested the storm crawler with some basic setup. I would like to use the crawler in my project but there are certain things I am not sure if the crawler is capable of doing or even if it is suitable for such use cases.

I would like to do small and large recursive crawls on many web domains with specific speed settings and limit to the number of fetched urls. The crawls can be started separately at any time with different settings (different speed, ignoring robots.txt for that domain, ignoring external links).

Questions:

Is the storm crawler suitable for such scenario?
Can I set the limit to the maximum number of pages fetched by the crawler?
Can I set the limits to the number of fetched pages for different domains?
Can I monitor the progress of the crawl for specific domains separately?
Can I set the settings dynamically without the need of uploading modified topology to storm?
Is it possible to pause or stop crawling (for specific domain)?
Is usually storm crawler running as one deployed topology?

I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler.

Thank you.

Julien Nioche · Accepted Answer

Glad you like StormCrawler

Is the storm crawler suitable for such scenario?

Probably but you'd need to modify/customise a few things.

Can I set the limit to the maximum number of pages fetched by the crawler?

You can currently set a limit on the depth from the seeds and have a different value per seed.

There is no mechanism for filtering globally based on the number of URLs but this could be done. It depends on what you use to store the URL status and the corresponding spout and status updater implementations. For instance, if you were using Elasticsearch for storing the URLs, you could have a URL filter check the number of URLs in the index and filter URLs (existing or not) based on that.

Can I set the limits to the number of fetched pages for different domains?

You could specialize the solution proposed above and query per domain or host for the number of URLs already known. Doing this would not require any modifications to the core elements, just a custom URL filter.

Can I monitor the progress of the crawl for specific domains separately?

Again, it depends on what you use as a back end. With Elasticsearch for instance, you can use Kibana to see the URLs per domain.

Can I set the settings dynamically without the need of uploading modified topology to storm?

No. The configuration is read when the worker tasks are started. I know of some users who wrote a custom configuration implementation backed by a DB table and got their components to read from that but this meant modifying a lot of code.

Is it possible to pause or stop crawling (for specific domain)?

Not on a per domain basis but you could add an intermediate bolt to check whether a domain should be processed or not. If not you could simply fail the ack. This depends on the status storage again. You could also add a custom filter to the ES spouts for instance and a field in the status index. Whenever the crawl should be halted for a specific domain, you could e.g. modify the value of the field for all the URLs matching a particular domain.

Is usually storm crawler running as one deployed topology?

Yes, often.

I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler.

StormCrawler is very modular so there is always several ways of doing things ;-)

I am pretty sure you could have the behavior you want while having a single topology by modifying small non-core parts. If more essential parts of the code (e.g. per seed robots settings) are needed, then we'd probably want to add that to the code - you contributions would be very welcome.

Domain-specific crawling with different settings for each domain (e.g. speed) using Storm crawler

Answers (2)

Related Questions