Reputation: 33
I have discovered Storm crawler only recently and from the past experience and studies and work with different crawlers I find this project based on Apache Storm pretty robust and suitable for many use cases and scenarios.
I have read some tutorials and tested the storm crawler with some basic setup. I would like to use the crawler in my project but there are certain things I am not sure if the crawler is capable of doing or even if it is suitable for such use cases.
I would like to do small and large recursive crawls on many web domains with specific speed settings and limit to the number of fetched urls. The crawls can be started separately at any time with different settings (different speed, ignoring robots.txt for that domain, ignoring external links).
Questions:
I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler.
Thank you.
Upvotes: 1
Views: 873
Reputation: 4854
Glad you like StormCrawler
Probably but you'd need to modify/customise a few things.
You can currently set a limit on the depth from the seeds and have a different value per seed.
There is no mechanism for filtering globally based on the number of URLs but this could be done. It depends on what you use to store the URL status and the corresponding spout and status updater implementations. For instance, if you were using Elasticsearch for storing the URLs, you could have a URL filter check the number of URLs in the index and filter URLs (existing or not) based on that.
You could specialize the solution proposed above and query per domain or host for the number of URLs already known. Doing this would not require any modifications to the core elements, just a custom URL filter.
Again, it depends on what you use as a back end. With Elasticsearch for instance, you can use Kibana to see the URLs per domain.
No. The configuration is read when the worker tasks are started. I know of some users who wrote a custom configuration implementation backed by a DB table and got their components to read from that but this meant modifying a lot of code.
Not on a per domain basis but you could add an intermediate bolt to check whether a domain should be processed or not. If not you could simply fail the ack. This depends on the status storage again. You could also add a custom filter to the ES spouts for instance and a field in the status index. Whenever the crawl should be halted for a specific domain, you could e.g. modify the value of the field for all the URLs matching a particular domain.
Yes, often.
StormCrawler is very modular so there is always several ways of doing things ;-)
I am pretty sure you could have the behavior you want while having a single topology by modifying small non-core parts. If more essential parts of the code (e.g. per seed robots settings) are needed, then we'd probably want to add that to the code - you contributions would be very welcome.
Upvotes: 2
Reputation: 2222
You have very interesting questions. I think you can discover more here: the code:https://github.com/DigitalPebble/storm-crawler oficial tutorial: http://stormcrawler.net/ and some responces: http://2015.berlinbuzzwords.de/sites/2015.berlinbuzzwords.de/files/media/documents/julien_nioche-low_latency_scalable_web_crawling_on_apache_storm.pdf
Upvotes: 0