Reputation: 152
I need to get the size of the documents of the query results.
Example:
this is a document. (19bytes).
this is also a document. (24bytes)
content:{"a":"this is a document", "b":"this is also a document"}(53bytes)
When I query for the document in ES, I will get the documents above as a result. So, the size of both documents is 32bytes. I need the 32bytes in Elasticsearch as a result.
Upvotes: 5
Views: 15912
Reputation: 322
Elasticsearch now has a _size field, which can be enabled in mappings.
Once enabled, this gives out data size in Bytes.
GET <index_name>/_doc/<doc_id>?stored_fields=_size
Upvotes: 3
Reputation: 10859
Does your document only contain a single field? I'm not sure this is 100% of what you want, but generally you can calculate the length of fields and either store them with the document or calculate them at query time (but this is a slow operation and I would avoid it if possible).
So here's an example with a test document and the calculation for the field length:
PUT test/_doc/1
{
"content": "this is a document."
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = ctx._source.content.length();
} else {
ctx._source.content_length = 0;
}
"""
}
}
GET test/_search
The query result is then:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"content" : "this is a document.",
"content_length" : 19
}
}
]
}
}
BTW there are 19 characters (including spaces and dots in that one). If you want to exclude those, you'll have to add some more logic to the script. I would be careful with bytes BTW, since UTF8 might use more than one byte per character (like höhe
) and this script is really only counting characters.
Then you can easily use the length in queries and aggregations.
If you want to calculate the size of all the subdocuments combined, use the following:
PUT test/_doc/2
{
"content": {
"a": "this is a document",
"b": "this is also a document"
}
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = 0;
for (item in ctx._source.content.entrySet()) {
ctx._source.content_length += item.getValue().length();
}
}
"""
}
}
GET test/_search
Just note that content
can either be of the type text or have a subdocument, but you can't mix that.
Upvotes: 4
Reputation: 1855
There's no way to get elasticsearch docs size by API. The reason is that the doc indexed to Elasticsearch takes different size in the index, depending on whether you store _all, which fields are indexed, and the mapping type of those fields, doc_value and more. also elasticsearch uses deduplication and other methods of compaction, so the index size has no linear correlation with the original documents it contains.
One way to work around it is to calculate the document size in advance before indexing it, and add it as another field in the doc, i.e. doc_size field. then you can query this calculated field, and run aggregations on it.
Note however that as I stated above this does not represent the size of the index, and might be completely wrong - for example if all the docs contain a very long text field with the same value, then Elasticsearch would only store that long value once and reference to it, so the index size would be much smaller.
Upvotes: 0