Reputation: 275
I'm using Nutch 1.6 and Solr 4.3 on Ubuntu Server 12.04 I would like to switch on and off content indexing. Is there a way to specify this behaviour in my HTML pages so that Solr can behave accordingly ?
As an example, when using Google Search Appliance I would use "googleon" - "googleoff" tags around the content on the page that i don't want indexed (headers, footers, copyright strings, etc ).
thank you
Upvotes: 0
Views: 248
Reputation: 22555
You wil need to create a custom plugin for Nutch to be able to accomplish this behavior. Below are some relevant links with examples.
Upvotes: 3
Reputation: 1
There is a text file, "robots.txt" that provide information to the search engines about which html pages the program is allowed or not to look for content. In the link FAQ robots.txt: How to stop indexing you will find all the information.
Upvotes: 0