ug_
ug_

Reputation: 11440

Querying denormalized tree data in Elasticsearch

I have tree data stored in Elasticsearch 7.9 with the data structure outlined below. I am trying to write a query which can give the top 10 children who have the most children underneath them.


Setup data

Given this example tree:

example tree

is described by the following data in ES:

{ "id": "A", "name": "User A" }
{ "id": "B", "name": "User B", "parents": ["A"], "parent1": "A" }
{ "id": "C", "name": "User C", "parents": ["A"], "parent1": "A" }
{ "id": "D", "name": "User D", "parents": ["A", "B"], "parent1": "B", "parent2": "A" }
{ "id": "E", "name": "User E", "parents": ["A", "B", "D"], "parent1": "D", "parent2": "B", "parent2": "A" }

every field is mapping type keyword

The document fields are:


Desired Results

I would like to find all "parents" from User A and the total count of children. So in this example case the results would be

User B - 2
User C - 0

Test it yourself

PUT test_index
PUT test_index/_mapping
{
  "properties": {
    "id": { "type": "keyword" },
    "name": { "type": "keyword" },
    "referred_by_sub": { "type": "keyword" },
    "parents": { "type": "keyword" },
    "parent1": { "type": "keyword" },
    "parent2": { "type": "keyword" },
    "parent3": { "type": "keyword" },
    "parent4": { "type": "keyword" },
    "parent5": { "type": "keyword" }
  }
}

POST _bulk
{ "index" : { "_index" : "test_index", "_id" : "A" } }
{ "id": "A", "name": "User A" }
{ "index" : { "_index" : "test_index", "_id" : "B" } }
{ "id": "B", "name": "User B", "parents": ["A"], "parent1": "A" }
{ "index" : { "_index" : "test_index", "_id" : "C" } }
{ "id": "C", "name": "User C", "parents": ["A"], "parent1": "A" }
{ "index" : { "_index" : "test_index", "_id" : "D" } }
{ "id": "D", "name": "User D", "parents": ["A", "B"], "parent1": "B", "parent2": "A" }
{ "index" : { "_index" : "test_index", "_id" : "E" } }
{ "id": "E", "name": "User E", "parents": ["A", "B", "D"], "parent1": "D", "parent2": "B", "parent2": "A" }

Final result expanded from Joe's answer

For anyone coming here in the future I like to post my final result if it differs from the accepted answer. Mine includes the resulting document source as well as an array. These were not in the requirements because I was attempting to make my question as simple as possible.

Maybe it will help someone in the future.

Query

GET test_index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "A"
          }
        },
        {
          "term": {
            "parents": "A"
          }
        }
      ]
    }
  },
  "aggs": {
    "children_counter": {
      "scripted_metric": {
        "init_script": "state.ids_vs_children = [:]; state.root_children = [:]",
        "map_script": """
          def current_id = doc['id'].value;
          if (!state.ids_vs_children.containsKey(current_id)) {
            state.ids_vs_children[current_id] = new ArrayList();
          }
          
          if(doc['parent1'].contains(params.id)) {
            state.root_children[current_id] = params._source;
          }
          
          def parents = doc['parents'];
          if (parents.size() > 0) {
            for (def p : parents) {
              if (!state.ids_vs_children[current_id].contains(p)) {
                if (!state.ids_vs_children.containsKey(p)) {
                  state.ids_vs_children[p] = new ArrayList();
                }
                state.ids_vs_children[p].add(current_id);
              }
            }
          }
        """,
        "combine_script": """
          def results = [];
          for (def pair : state.ids_vs_children.entrySet()) {
            def uid = pair.getKey();
            if (!state.root_children.containsKey(uid)) {
              continue;
            }
            
            def doc_map = [:];
            doc_map["doc"] = state.root_children[uid];
            doc_map["num_children"] = pair.getValue().size();
            results.add(doc_map);
          }
      
          def final_result = [:];
          final_result['count'] = results.length;
          final_result['results'] = results;
          return final_result;
        """,
        "reduce_script": "return states",
        "params": {
          "id": "A"
        }
        
      }
    }
  }
}

Output

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "children_counter" : {
      "value" : [
        {
          "count" : 2,
          "results" : [
            {
              "num_children" : 1,
              "doc" : {
                "parent1" : "A",
                "name" : "User B",
                "id" : "B",
                "parents" : [
                  "A"
                ]
              }
            },
            {
              "num_children" : 0,
              "doc" : {
                "parent1" : "A",
                "name" : "User C",
                "id" : "C",
                "parents" : [
                  "A"
                ]
              }
            }
          ]
        }
      ]
    }
  }
}

Upvotes: 2

Views: 889

Answers (1)

Joe - Check out my books
Joe - Check out my books

Reputation: 16943

Your denormalized tree already contains everything you need for that calculation but we'll need to access other docs' parents as we traverse the children and keep track of the references so that's a perfect use case for a scripted metric aggregation.

GET test_index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "A"
          }
        },
        {
          "term": {
            "parents": "A"
          }
        }
      ]
    }
  },
  "aggs": {
    "children_counter": {
      "scripted_metric": {
        "init_script": "state.ids_vs_children = [:];",
        "map_script": """
          def current_id = doc['id'].value;
          if (!state.ids_vs_children.containsKey(current_id)) {
            state.ids_vs_children[current_id] = new ArrayList();
          }
          
          def parents = doc['parents'];
          if (parents.size() > 0) {
            for (def p : parents) {
              if (!state.ids_vs_children[current_id].contains(p)) {
                state.ids_vs_children[p].add(current_id);
              }
            }
          }
        """,
        "combine_script": """
          def final_map = [:];
          for (def pair : state.ids_vs_children.entrySet()) {
            def uid = pair.getKey();
            if (params.exclude_users != null && params.exclude_users.contains(uid)) {
              continue;
            }
            
            final_map[uid] = pair.getValue().size();
          }
      
          return final_map;
        """,
        "reduce_script": "return states",
        "params": {
          "exclude_users": ["A"]
        }
      }
    }
  }
}

yielding

...
"aggregations" : {
  "children_counter" : {
    "value" : [
      {
        "B" : 2,    <--
        "C" : 0,    <--
        "D" : 1,
        "E" : 0
      }
    ]
  }
}

A top-level query is strongly recommended so you don't blow your CPU up b/c scripts like this are notoriously resource-intensive. A top-level query is required to limit this to only A's children.

Tip: if you don't too frequently update these users, I'd advise to perform this children calculation before indexing -- you'll have to iterate somewhere so why not outside of ES?

Upvotes: 3

Related Questions