CoolMcGrrr
CoolMcGrrr

Reputation: 774

Elastic search Nest TopHits aggregation

I've been struggling with a problem for a while now, so i thought i would swing this by stackoverflow.

My document type has a title, a language field (used to filter) and a grouping id field (im leaving out all the other fields to keep this to the point)

When i search for documents i want to find all documents containing the text in the title. I only want one document for each unique grouping id.

I've been looking at tophits aggregation, and from what i can see it should be able to solve my problem.

When running this query against my index:

{
  "query": {
    "match": {
      "title": "dingo"
    }
  },
  "aggs": {
    "top-tags": {
      "terms": {
        "field": "groupId",
        "size": 1000000
      },
      "aggs": {
        "top_tag_hits": {
          "top_hits": {
            "_source": {
              "include": [
                "*"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

I get the following response (All results are in the same language):

{
    "took": 9,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "top-tags": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [{
                "key": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
                "doc_count": 2,
                "top_tag_hits": {
                    "hits": {
                        "total": 2,
                        "max_score": 1.4983996,
                        "hits": [{
                            "_index": "elasticsearch",
                            "_type": "productdocument",
                            "_id": "FB15279FB18E4B34AD66ACAF69B96E9E",
                            "_score": 1.4983996,
                            "_source": {
                                "groupId": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
                                "title": "wombat, dingo and zetapunga actionfigures",

                            }
                        }]
                    }
                }
            },
            {
                "key": "F11799ABD0C14B98ADF2554C84FF0DA0",
                "doc_count": 1,
                "top_tag_hits": {
                    "hits": {
                        "total": 1,
                        "max_score": 1.30684,
                        "hits": [{
                            "_index": "elasticsearch",
                            "_type": "productdocument",
                            "_id": "42562A25E4434A0091DE0C79A3E7F3F4",
                            "_score": 1.30684,
                            "_source": {
                                "groupId": "F11799ABD0C14B98ADF2554C84FF0DA0",
                                "title": "awesome dingo raptor"
                            }
                        }]
                    }
                }
            }]
        }
    }
}

This is exactly what i expected (two hits in one bucket, but only one document retrieved for that bucket). However when i try this in NEST i can't seem to retrieve all of the documents.

My query looks like this:

result = _elasticClient.Search<T>(s => s
                .From(skip)
                .Filter(fd => fd.Term(f => f.Language, language))
                .Size(pageSize)
                .SearchType(SearchType.Count)
                .Query(
                    q => q.Wildcard(f => f.Title, query, 2.0)
                         || q.Wildcard(f => f.Description, query)
                )
                .Aggregations(agd =>
                    agd.Terms("groupId", tagd => tagd
                        .Field("groupId")
                        .Size(100000) //We sadly need all products
                    )
                    .TopHits("top_tag_hits", thagd => thagd
                        .Size(1)
                        .Source(ssd => ssd.Include("*")))
                ));

var topHits = result.Aggs.TopHits("top_tag_hits");
var documents = topHits.Documents<ProductDocument>(); //contains only one document (I would expect it to contain two, one for each bucket)

Inspecting the aggregations in the debugger reveals there is a "groupId" aggregation with 2 buckets (and matching what i see in my "raw" query against the index. Just without any apparent way to retrieve the documents)

So my question is. How do i retrieve the top hit for each bucket? Or am i doing this completely wrong? Is there some other way to achieve what i am trying to do?

EDIT

After the help i received, i was able to retrieve my results with the following:

result = _elasticClient.Search<T>(s => s
                .From(skip)
                .Filter(fd => fd.Term(f => f.Language, language))
                .Size(pageSize)
                .SearchType(SearchType.Count)
                .Query(
                    q => q.Wildcard(f => f.Title, query, 2.0)
                         || q.Wildcard(f => f.Description, query)
                )
                .Aggregations(agd =>
                    agd.Terms("groupId", tagd => tagd
                        .Field("groupId")
                        .Size(0)
                    .Aggregations(tagdaggs =>
                        tagdaggs.TopHits("top_tag_hits", thagd => thagd
                            .Size(1)))
                    )
                )
                );

                var groupIdAggregation = result.Aggs.Terms("groupId");

                var topHits =
                    groupIdAggregation.Items.Select(key => key.TopHits("top_tag_hits"))
                        .SelectMany(topHitMetric => topHitMetric.Documents<ProductDocument>()).ToList();

Upvotes: 3

Views: 3774

Answers (1)

Evaldas Buinauskas
Evaldas Buinauskas

Reputation: 14077

Your NEST query tries to run both Terms aggregation and TopHits side by side, while your original query runs Terms first and then for each bucket, you're calling TopHits.

You simply have to move your TopHits agg into Terms in your NEST query to make it work fine.

This should fix it:

.Aggregations(agd =>
    agd.Terms("groupId", tagd => tagd
        .Field("groupId")
        .Size(0)
        .Aggregations(tagdaggs =>
            tagdaggs.TopHits("top_tag_hits", thagd => thagd
                .Size(1)))
    )
)

By the way, you don't have to use Include("*") to include all fields. Just remove this option, also specifying .Size(0) should bring back ALL possible terms for you.

Upvotes: 4

Related Questions