Sabyasachi Behera
Sabyasachi Behera

Reputation: 80

focused crawler by modifying nutch

I want to create a focused crawler using nutch. Is there any way to modify nutch so as to make crawling faster? Can we use the metadata in nutch to train a classifier that would reduce the number of urls nutch has to crawl for a given topic??

Upvotes: 1

Views: 479

Answers (1)

Ali
Ali

Reputation: 1869

If the extracted urls could be differentiated by Regular expression you can do that with current Nutch by adding the specific regex filter. But if you are going to classify URL according to some metadata features related to page you have to implement a customized HTMLParseFilter to filter Outlink[] during parse step. For more information about How to develop a plugin for Nutch follow these links:

http://wiki.apache.org/nutch/AboutPlugins

http://wiki.apache.org/nutch/WritingPluginExample

Upvotes: 2

Related Questions