eran
eran

Reputation: 15136

How does doc_values work in ElasticSearch

Can someone explain to me how does doc_values work? Why would that help me when doing aggregations?

Would it help me when filtering?

For filtering, the way I see it, ElasticSearch would access the inverted index to find "pointers" to all the documents that fit the aggregations, so doc_values, which is an "uninverted index" according to the documentation, is irrelevant? Or am I wrong?

Can someone explain the flow of an aggregation when doc_values is enabled, and when it isn't, and why enabling it saves memory?

Thanks.

Upvotes: 1

Views: 2882

Answers (1)

Andrei Stefan
Andrei Stefan

Reputation: 52368

General statements about doc_values:

  • doc_values will help with heap memory usage
  • they are used for the memory section called fielddata
  • fielddata is being used when sorting, doing aggregations, when using scripts that access field values, when using parent-child relationships and geo-distance filters

Until doc_values came into play, fielddata was being loaded into heap. doc_values will not use the heap, but the memory outside the heap - the file system cache, because doc_values will live in the file system. Lucene will access the file system, the operating system will cache it in the file system cache and then serve requests from there.

Why is this important: the heap has a limited size and the recommendation is not to use more than 30ish GB for heap size. The heap, also, contains other sections: filter caches, query caches, indexing buffers, meta-data from the segment files etc. Fielddata, usually, takes a lot of room not because it is inefficient, but because ES needs to load all the documents into memory so that it can sort, aggregate on them. For larger indices (implicitly, shards) this means a lot of data.

That's why doc_values were introduced: move all this burden from the heap (which is limited) to the OS file system cache (which is limited, as well, but definitely with less pressure on it).

doc_values it will not help you with aggregations per se. doc_values means fielddata. Fielddata is mandatory for aggregations. doc_values will help you with heap memory usage.

Upvotes: 6

Related Questions