Brent Hronik
Brent Hronik

Reputation: 2427

Retrieve largest document size in ElasticSearch

Is it possible to retrieve the largest document(or just its size) in ElasticSearch with a single query?

The motivation for doing so is to cache returned documents in a MySQL store, so I would like to get an idea of the order of magnitude of largest docs, to decide whether to go with TEXT, MEDIUMTEXT or LONGTEXT.

EDIT: This is on ES 1.3.

Upvotes: 1

Views: 2684

Answers (2)

Vidal Gonzalez
Vidal Gonzalez

Reputation: 61

My rough quick approach was to create a new temporary index, via reindex, adding a new field with the string representation size:

POST _reindex
{
  "source": {
    "index": "input_index"
  },
  "dest": {
    "index": "docs_size_index"
  },
  "script": {
    "source": """
      HashMap st = ctx._source;
      if (st != null){
        ctx._source['docsize'] = st.toString().length();
      } else { 
        ctx._source['docsize'] = 0;
      }
    """
  }
}

And then querying this new temporary index while using sort.

GET docs_size_index/_search
{
  "_source": {
    "includes": "['docsize']"
  },
  "sort": [
    {
      "docsize": {
        "order": "desc"
      }
    }
  ]
}

The first element will be the biggest doc in your index, which then you can retrieve and get the actual size

curl -XGET "http://localhost:9700/modules/_doc/<DOC_ID>" | json_pp > biggest_doc.json

Upvotes: 0

Roman
Roman

Reputation: 2118

To the best of my knowledge, there's no such possibility out of the box.

You could, however, try a scripted aggregation, where the value of the aggregation is the sum of the length of all fields (or all fields you care about).

Another option: try setting a script sorting order for the documents. for example:

"sort": {
    "_script": {
        "script": "doc['field1'].value.size() + doc['field2'].value.size()",
        "type": "number",
        "order": "desc"
    }
}

Upvotes: 1

Related Questions