Reputation: 11
During solrindex, how to tell Nutch to skip indexing those documents with an empty content field?
I found http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/, but the index-omit plugin will only allow Nutch to filter those documents without certain metatag fields, not general fields such as content.
Upvotes: 0
Views: 767
Reputation: 1708
You might need to implement a new Nutch filter that discards the document if the content is empty.
You can get more information on how to write a plugin following this link: https://wiki.apache.org/nutch/AboutPlugins
EDIT:
I wrote a simple plugin just as an example.
It looks at the "content" field and if it's empty it will ignore the document and not index it.
You can get it from here: https://github.com/nimeshjm/index-discardemptycontent
Upvotes: 2