Reputation: 1103
I am working on Solr 6.5, and one thing I noticed is that my index file size keeps on increasing with content. I have used a stop word file and no common words are indexed.
I see many HTML tags in the index, which I do not want to index, as well as comments in content which should not be indexed. How can I find these and update my stopword txt to handle them?
I have indexed english content only, and the index file is already 30 GB, with only 9 million documents.
Upvotes: 1
Views: 845
Reputation: 52852
You can use a HTMLStripFilterFactory to remove all HTML content when indexing.
But 30GB for 9 million documents is just under 4kb per document, which isn't really that much. These documents do have an inherent size, so they will add data to the index as long as you're indexing them.
Upvotes: 2