Querying denormalized tree data in Elasticsearch

Question

I have tree data stored in Elasticsearch 7.9 with the data structure outlined below. I am trying to write a query which can give the top 10 children who have the most children underneath them.

Setup data

Given this example tree:

is described by the following data in ES:

{ "id": "A", "name": "User A" }
{ "id": "B", "name": "User B", "parents": ["A"], "parent1": "A" }
{ "id": "C", "name": "User C", "parents": ["A"], "parent1": "A" }
{ "id": "D", "name": "User D", "parents": ["A", "B"], "parent1": "B", "parent2": "A" }
{ "id": "E", "name": "User E", "parents": ["A", "B", "D"], "parent1": "D", "parent2": "B", "parent2": "A" }

every field is mapping type keyword

The document fields are:

"id" - Documents id, same as _id,
"parents" - All parents of the document or empty if its a root node
"parent1" - The parent of the document
"parent2" - The grand parent of the document
"parentN" - The Nth great grand parent up to 5

Desired Results

I would like to find all "parents" from User A and the total count of children. So in this example case the results would be

User B - 2
User C - 0

Test it yourself

PUT test_index
PUT test_index/_mapping
{
  "properties": {
    "id": { "type": "keyword" },
    "name": { "type": "keyword" },
    "referred_by_sub": { "type": "keyword" },
    "parents": { "type": "keyword" },
    "parent1": { "type": "keyword" },
    "parent2": { "type": "keyword" },
    "parent3": { "type": "keyword" },
    "parent4": { "type": "keyword" },
    "parent5": { "type": "keyword" }
  }
}

POST _bulk
{ "index" : { "_index" : "test_index", "_id" : "A" } }
{ "id": "A", "name": "User A" }
{ "index" : { "_index" : "test_index", "_id" : "B" } }
{ "id": "B", "name": "User B", "parents": ["A"], "parent1": "A" }
{ "index" : { "_index" : "test_index", "_id" : "C" } }
{ "id": "C", "name": "User C", "parents": ["A"], "parent1": "A" }
{ "index" : { "_index" : "test_index", "_id" : "D" } }
{ "id": "D", "name": "User D", "parents": ["A", "B"], "parent1": "B", "parent2": "A" }
{ "index" : { "_index" : "test_index", "_id" : "E" } }
{ "id": "E", "name": "User E", "parents": ["A", "B", "D"], "parent1": "D", "parent2": "B", "parent2": "A" }

Final result expanded from Joe's answer

For anyone coming here in the future I like to post my final result if it differs from the accepted answer. Mine includes the resulting document source as well as an array. These were not in the requirements because I was attempting to make my question as simple as possible.

Maybe it will help someone in the future.

Query

GET test_index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "A"
          }
        },
        {
          "term": {
            "parents": "A"
          }
        }
      ]
    }
  },
  "aggs": {
    "children_counter": {
      "scripted_metric": {
        "init_script": "state.ids_vs_children = [:]; state.root_children = [:]",
        "map_script": """
          def current_id = doc['id'].value;
          if (!state.ids_vs_children.containsKey(current_id)) {
            state.ids_vs_children[current_id] = new ArrayList();
          }
          
          if(doc['parent1'].contains(params.id)) {
            state.root_children[current_id] = params._source;
          }
          
          def parents = doc['parents'];
          if (parents.size() > 0) {
            for (def p : parents) {
              if (!state.ids_vs_children[current_id].contains(p)) {
                if (!state.ids_vs_children.containsKey(p)) {
                  state.ids_vs_children[p] = new ArrayList();
                }
                state.ids_vs_children[p].add(current_id);
              }
            }
          }
        """,
        "combine_script": """
          def results = [];
          for (def pair : state.ids_vs_children.entrySet()) {
            def uid = pair.getKey();
            if (!state.root_children.containsKey(uid)) {
              continue;
            }
            
            def doc_map = [:];
            doc_map["doc"] = state.root_children[uid];
            doc_map["num_children"] = pair.getValue().size();
            results.add(doc_map);
          }
      
          def final_result = [:];
          final_result['count'] = results.length;
          final_result['results'] = results;
          return final_result;
        """,
        "reduce_script": "return states",
        "params": {
          "id": "A"
        }
        
      }
    }
  }
}

Output

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "children_counter" : {
      "value" : [
        {
          "count" : 2,
          "results" : [
            {
              "num_children" : 1,
              "doc" : {
                "parent1" : "A",
                "name" : "User B",
                "id" : "B",
                "parents" : [
                  "A"
                ]
              }
            },
            {
              "num_children" : 0,
              "doc" : {
                "parent1" : "A",
                "name" : "User C",
                "id" : "C",
                "parents" : [
                  "A"
                ]
              }
            }
          ]
        }
      ]
    }
  }
}

Joe - Check out my books · Accepted Answer

Your denormalized tree already contains everything you need for that calculation but we'll need to access other docs' parents as we traverse the children and keep track of the references so that's a perfect use case for a scripted metric aggregation.

GET test_index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "A"
          }
        },
        {
          "term": {
            "parents": "A"
          }
        }
      ]
    }
  },
  "aggs": {
    "children_counter": {
      "scripted_metric": {
        "init_script": "state.ids_vs_children = [:];",
        "map_script": """
          def current_id = doc['id'].value;
          if (!state.ids_vs_children.containsKey(current_id)) {
            state.ids_vs_children[current_id] = new ArrayList();
          }
          
          def parents = doc['parents'];
          if (parents.size() > 0) {
            for (def p : parents) {
              if (!state.ids_vs_children[current_id].contains(p)) {
                state.ids_vs_children[p].add(current_id);
              }
            }
          }
        """,
        "combine_script": """
          def final_map = [:];
          for (def pair : state.ids_vs_children.entrySet()) {
            def uid = pair.getKey();
            if (params.exclude_users != null && params.exclude_users.contains(uid)) {
              continue;
            }
            
            final_map[uid] = pair.getValue().size();
          }
      
          return final_map;
        """,
        "reduce_script": "return states",
        "params": {
          "exclude_users": ["A"]
        }
      }
    }
  }
}

yielding

...
"aggregations" : {
  "children_counter" : {
    "value" : [
      {
        "B" : 2,    <--
        "C" : 0,    <--
        "D" : 1,
        "E" : 0
      }
    ]
  }
}

~~A top-level query is strongly recommended so you don't blow your CPU up b/c scripts like this are notoriously resource-intensive.~~ A top-level query is required to limit this to only A's children.

Tip: if you don't too frequently update these users, I'd advise to perform this children calculation before indexing -- you'll have to iterate somewhere so why not outside of ES?

Querying denormalized tree data in Elasticsearch

Setup data

Desired Results

Test it yourself

Final result expanded from Joe's answer

Answers (1)

Related Questions