Reputation: 80
I want to create a focused crawler using nutch. Is there any way to modify nutch so as to make crawling faster? Can we use the metadata in nutch to train a classifier that would reduce the number of urls nutch has to crawl for a given topic??
Upvotes: 1
Views: 479
Reputation: 1869
If the extracted urls could be differentiated by Regular expression you can do that with current Nutch by adding the specific regex filter. But if you are going to classify URL according to some metadata features related to page you have to implement a customized HTMLParseFilter to filter Outlink[] during parse step. For more information about How to develop a plugin for Nutch follow these links:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
Upvotes: 2