Solr Index file removing html tags and garbage content form indexing

Question

I am working on Solr 6.5, and one thing I noticed is that my index file size keeps on increasing with content. I have used a stop word file and no common words are indexed.

I see many HTML tags in the index, which I do not want to index, as well as comments in content which should not be indexed. How can I find these and update my stopword txt to handle them?

I have indexed english content only, and the index file is already 30 GB, with only 9 million documents.

Solr Index file removing html tags and garbage content form indexing

Answers (1)

Related Questions