codemonkey
codemonkey

Reputation: 11

How to skip documents with empty content field during Nutch to Solr indexing?

During solrindex, how to tell Nutch to skip indexing those documents with an empty content field?

I found http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/, but the index-omit plugin will only allow Nutch to filter those documents without certain metatag fields, not general fields such as content.

Upvotes: 0

Views: 767

Answers (1)

nimeshjm
nimeshjm

Reputation: 1708

You might need to implement a new Nutch filter that discards the document if the content is empty.

You can get more information on how to write a plugin following this link: https://wiki.apache.org/nutch/AboutPlugins

EDIT:
I wrote a simple plugin just as an example. It looks at the "content" field and if it's empty it will ignore the document and not index it.

You can get it from here: https://github.com/nimeshjm/index-discardemptycontent

Upvotes: 2

Related Questions