Worp
Worp

Reputation: 1018

Elasticsearch Query - Return all documents that do not have a corresponding document

I have an index that contains documents who have a status. These are initially imported with a job and their status is set to 0.

For simplicity:

{
    "_uid" : 1234
    "id" : 1
    "name" : "someName",
    "status" : 0
}

Then another import job runs and extends these objects by iterating over each object with status=0. Each object that is extended gets the status 1.

{
    "_uid" : 1234
    "id" : 1
    "name" : "someName",
    "newProperty" : "someValue",
    "status" : 1
}

(Note the unchanged _uid. It's the same object)

Now I have a third import job that takes all objects with status one, takes their ID (the ID!!! Not their _uid!) and creates a new object with the same ID, but different UID:

{
    "_uid" : 5678
    "id" : 1
    "completelyDifferentProperty" : "someValue"
    "status" : 2
}

So now, for each ID, I have two objects: One with status = 1, One with status = 2.

For the last job I need to make sure that it only picks objects with status =1 that DO NOT YET have a corresponding status=2 object.

So I need a query to the effect of "Get all objects where status == 1 for which no status == 2 object with the same ID exists".

I have a feeling aggregations might help me but I haven't gotten it figured out yet.

Upvotes: 2

Views: 345

Answers (1)

Sloan Ahrens
Sloan Ahrens

Reputation: 8718

You can do it fairly easily with a parent/child relationship. This is sort of a special-case use of the capability, but I think it could be used to solve your problem.

To test it out, I set up an index like this, with parent_doc type and a child_doc type (I only included the properties necessary to set up the capability; it doesn't hurt to add more in your documents):

PUT /test_index
{
   "mappings": {
      "parent_doc": {
         "_id": {
            "path": "id"
         },
         "properties": {
            "id": {
               "type": "long"
            },
            "_uid": {
               "type": "long"
            },
            "status": {
               "type": "integer"
            }
         }
      },
      "child_doc": {
         "_parent": {
            "type": "parent_doc"
         },
         "_id": {
            "path": "id"
         },
         "properties": {
            "id": {
               "type": "long"
            },
            "_uid": {
               "type": "long"
            },
            "status": {
               "type": "long"
            }
         }
      }
   }
}

Then I added four docs; three parents, one child. There is one document that has "status: 1 that doesn't have a corresponding child document.

POST /test_index/_bulk
{"index":{"_type":"parent_doc"}}
{"_uid":1234,"id":1,"name":"someName","newProperty":"someValue","status":0}
{"index":{"_type":"parent_doc"}}
{"_uid":1234,"id":2,"name":"someName","newProperty":"someValue","status":1}
{"index":{"_type":"child_doc","_parent":2}}
{"_uid":5678,"id":2,"completelyDifferentProperty":"someValue","status":2}
{"index":{"_type":"parent_doc"}}
{"_uid":4321,"id":3,"name":"anotherName","newProperty":"anotherValue","status":1}

We can find the document we want like this; notice we are querying only the parent_doc type, and that our conditions are that status is 1 and no child (at all) exists:

POST /test_index/parent_doc/_search
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "term": {
                        "status": 1
                     }
                  },
                  {
                     "not": {
                        "filter": {
                           "has_child": {
                              "type": "child_doc",
                              "query": {
                                 "match_all": {}
                              }
                           }
                        }
                     }
                  }
               ]
            }
         }
      }
   }
}

This returns:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "3",
            "_score": 1,
            "_source": {
               "_uid": 4321,
               "id": 3,
               "name": "anotherName",
               "newProperty": "anotherValue",
               "status": 1
            }
         }
      ]
   }
}

Here's all the code I used to test it:

http://sense.qbox.io/gist/d1a0267087d6e744b991de5cdec1c31d947ebc13

Upvotes: 1

Related Questions