Reputation: 1958
I have a field that store array of strings. different documents hold different set of strings.
ex: "ftypes": ["PDF", "TXT", "XML"]
now I used this aggregation query to analyze each file type usage.
{
"aggs": {
"list": {
"terms": {
"field": "ftypes",
"min_doc_count": 0,
"size": 100000
}
}
}
}
result ==>
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 137265,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "PDF",
"doc_count": 134475
},
{
"key": "TXT",
"doc_count": 21312
},
{
"key": "XML",
"doc_count": 6597
},
{
"key": "JPG",
"doc_count": 1233
}
]
}
}
}
and the results were correct as expected. but recently I've updated this field after removing XML file support. so non of the doc has file type XML. i can confirm that from this query.
{
"query": {
"terms": {
"ftypes": ["XML"]
}
}
}
result ===>
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
total hits count is zero. strange thing is when I do the above aggregation query again yet I can see XML as a term. doc count is zero.
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 137265,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "PDF",
"doc_count": 134475
},
{
"key": "TXT",
"doc_count": 21312
},
{
"key": "JPG",
"doc_count": 1233
},
{
"key": "XML",
"doc_count": 0
}
]
}
}
}
where is this XML term is now coming from if it does not exists on any document?. is there are any cache that i need to remove?
Upvotes: 1
Views: 924
Reputation: 22974
I suggest you to refer this
ES uses Lucene under the hood. They are called ghost terms
. Here, XML
is a ghost term in the index.
Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
All subsequent searches simply skip any deleted documents. It is not until segments are merged that the bytes consumed by deleted documents are reclaimed. Likewise, any terms that occur only in deleted documents (ghost terms) are not removed until merge.
The link has enough reasons for this process.
To avoid that term from the output, you need to set min_doc_count:1
, it will fetch docs with at least one document
Upvotes: 2