viren
viren

Reputation: 1103

Solr Index file removing html tags and garbage content form indexing

I am working on Solr 6.5, and one thing I noticed is that my index file size keeps on increasing with content. I have used a stop word file and no common words are indexed.

I see many HTML tags in the index, which I do not want to index, as well as comments in content which should not be indexed. How can I find these and update my stopword txt to handle them?

I have indexed english content only, and the index file is already 30 GB, with only 9 million documents.

Upvotes: 1

Views: 845

Answers (1)

MatsLindh
MatsLindh

Reputation: 52852

You can use a HTMLStripFilterFactory to remove all HTML content when indexing.

But 30GB for 9 million documents is just under 4kb per document, which isn't really that much. These documents do have an inherent size, so they will add data to the index as long as you're indexing them.

Upvotes: 2

Related Questions