Ivo Kurtanovic
Ivo Kurtanovic

Reputation: 41

Apache NUTCH, relevant crawling

I am crawling websites using Apache NUTCH 2.2.1, which provides me content to index on SOLR. When NUTCH fetches content, there are contextual information such as "contact us","legal notice" or some other irrelevant information (generally coming from upper menu, left menu or from footer of the page) that I do not need to index.

One of the solution would be to automatically select the most relevant part of the content to index, which can be done by an automatic summarizer. There is a plugin "summary-basic", is it used for this purpose? If so how is it configured? Other solutions are also welcome.

Upvotes: 0

Views: 183

Answers (1)

gbharat
gbharat

Reputation: 276

In regex-urlfilter.txt you can specify the list of urls you want to ignore. You can specify the http link for "contact us"(typically all header, footer information that you don't want to crawl) etc.. in that regex list. While crawling web, nutch will ignore those urls and will only fetch the requires content. You can find regex-urlfilter.txt under apache-nutch-2.2.1/conf folder

Upvotes: 0

Related Questions