Reputation: 19664
I have an index on elasticsearch loaded up with documents. If I delete all documents on that index, but keep the index itself, does it keep the tokens used in the tf-idf scoring on that field? Ie, if I load new documents, are they retokenized and reanalyzed with the old contents of this index's data, or are results entirely new as if the old documents had never existed? Is there memory in the scoring data across deleting all documents?
Upvotes: 0
Views: 308
Reputation: 27487
There is some memory in the scoring process after you have deleted documents in Elasticsearch. Specifically, the TF-IDF scoring process uses the maxDOCS value of a shard (scoring is done per shard, not per index) in it's scoring. maxDOCS is not, however, updated after deleting documents so the scoring can be influenced. From a previous discuss in github:
well deleted documents still contribute to the score calculation since they are only marked as deleted but statistics are not updated so yes they contribute to the score.
https://github.com/elasticsearch/elasticsearch/issues/3578
Regarding data itself, the data is still in the Lucene index after a deletion, it's just marked as deleted and not user or returned after that. The actual removal of the data occurs when you merge lucene segment files.
Practically this has no impact other than the lingering issue with maxDOCS and scoring. New documents are tokenized and analyzed, without the existing deleted documents having any impact. So while there is some memory in scoring process it's usually not considered a big issue.
Upvotes: 1