How to make an elasticsearch query that filters on the maximum value of a field?

Question

I would like to be able to query for text but also retrieve only the results with the maximum value of a certain integer field in my data. I have read the docs about aggregations and filters and I don't quite see what I am looking for.

For instance, I have some repeating data that gets indexed that is the same except for an integer field - let's call this field lastseen.

So, as an example, given this data put into elasticsearch:

  //  these two the same except "lastseen" field
  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
    "lastseen": 1000
  }'

  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
    "somevalue": 100
  }'

  # and these two the same except "lastseen" field
  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "fish chicken something",
    "field2": "dinner",
    "lastseen": 2000
  }'

  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "fish chicken something",
    "field2": "dinner",
    "lastseen": 200
  }'

If I query for "dinner"

  curl -XPOST localhost:9200/myindex -d '{  
   "query": {
        "query_string": {
            "query": "dinner"
        }
    }
    }'

I'll get 4 results back. I'd like to have a filter such that I only get two results back - only the items with the maximum lastseen field.

This is obviously not right, but hopefully it gives you an idea of what I am after:

{
    "query": {
        "query_string": {
            "query": "dinner"
        }
    },
    "filter": {
          "max": "lastseen"
        }

}

The results would look something like:

"hits": [
      {
        ...
        "_source": {
          "field1": "dinner carrot potato broccoli",
          "field2": "something here",
          "lastseen": 1000
        }
      },
      {
        ...
        "_source": {
          "field1": "fish chicken something",
          "field2": "dinner",
          "lastseen": 2000
        }
      } 
   ]

update 1: I tried creating a mapping that excluded lastseen from being indexed. This did not work. Still getting all 4 results back.

curl -XPOST localhost:9200/myindex -d '{  
    "mappings": {
      "myobject": {
        "properties": {
          "lastseen": {
            "type": "long",
            "store": "yes",
            "include_in_all": false
          }
        }
      }
    }
}'

update 2: I tried a deduplication with the agg scheme listed here, and it did not work, but more importantly, I don't see a way to combine that with a keyword search.

Andrei Stefan · Accepted Answer

Not ideal, but I think it gets you what you need.

Change the mapping of your field1 field, assuming this is the one that you use to define "duplicate" documents, like this:

PUT /lastseen
{
  "mappings": {
    "test": {
      "properties": {
        "field1": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "field2": {
          "type": "string"
        },
        "lastseen": {
          "type": "long"
        }
      }
    }
  }
}

meaning, you add a .raw subfield that is not_analyzed which means it will be indexed just the way it is, no analysis and split into terms. This is to make possible the somewhat "duplicate documents spotting".

Then, you need to use a terms aggregation on field1.raw (for duplicates) and a top_hits sub-aggregation to get a single document for each field1 value:

GET /lastseen/test/_search
{
  "size": 0,
  "query": {
    "query_string": {
      "query": "dinner"
    }
  },
  "aggs": {
    "field1_unique": {
      "terms": {
        "field": "field1.raw",
        "size": 2
      },
      "aggs": {
        "first_one": {
          "top_hits": {
            "size": 1,
            "sort": [{"lastseen": {"order":"desc"}}]
          }
        }
      }
    }
  }
}

Also, that single document returned by top_hits is the one with the highest lastseen (thing made possible by "sort": [{"lastseen": {"order":"desc"}}]).

The results you will get back are these (under aggregations not hits):

   ...
   "aggregations": {
      "field1_unique": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "dinner carrot potato broccoli",
               "doc_count": 2,
               "first_one": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_index": "lastseen",
                           "_type": "test",
                           "_id": "AU60ZObtjKWeJgeyudI-",
                           "_score": null,
                           "_source": {
                              "field1": "dinner carrot potato broccoli",
                              "field2": "something here",
                              "lastseen": 1000
                           },
                           "sort": [
                              1000
                           ]
                        }
                     ]
                  }
               }
            },
            {
               "key": "fish chicken something",
               "doc_count": 2,
               "first_one": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_index": "lastseen",
                           "_type": "test",
                           "_id": "AU60ZObtjKWeJgeyudJA",
                           "_score": null,
                           "_source": {
                              "field1": "fish chicken something",
                              "field2": "dinner",
                              "lastseen": 2000
                           },
                           "sort": [
                              2000
                           ]
                        }
                     ]
                  }
               }
            }
         ]
      }
   }

How to make an elasticsearch query that filters on the maximum value of a field?

Answers (1)

Related Questions