Reputation: 7766
I have a relatively simple case. I basically want to store data about links between various websites, and don't want to limit the domains. I know I could write my own crawler using some http client library, but I feel that I would be doing some unnecessary work -- making sure pages are not checked more than once, working out how to read and use a robots.txt file, maybe even trying to make it concurrent and distributed, and I'm sure a lot of other things that I haven't yet thought of.
So I wanted a framework for web crawling that takes care of these kind of things, while allowing me to dictate what to do with the responses (in my case, just extracting the links and storing them). Most crawlers seem to assume you're indexing web pages for search, and that's no good, I need something customizable.
I want to store the link information in a MongoDB database, so I need to be able to dictate how the links are stored in the framework. And although I've tagged the question as language-agnostic, this also means that I have to limit the choice to a framework in one of MongoDB's supported languages (Python, Ruby, Perl, PHP, Java and C++), which is a very wide net. I prefer dynamic languages, but I'm open to any suggestions.
I have been able to find Scrapy (which looks neat), and JSpider (which seems good, but perhaps a bit too "heavy duty", based on the 121 page user manual), but I wanted to see if there were other good options out there I'm missing.
Upvotes: 2
Views: 1264
Reputation: 4864
StormCrawler was not around when this question was asked but would have fitted the bill perfectly. It is in Java, is highly modular and scalable and can be customised to do exactly what was described above.
Upvotes: 0
Reputation: 64761
I suppose you have already searched Stack Overflow yourself as there are quite a few pretty similar questions within those tagged web-crawler? Having used none of the following extensively I refrain from elaborating and just list a few I feel worth reviewing for the task at hand:
Well, good luck for the review ;)
Upvotes: 6