Foo L
Foo L

Reputation: 11137

Elastic Find text in record group within time range

Elastic newbie here. I'm storing articles of blogs in ElasticSearch in this format:

{
  blog_id: keyword,
  blog_article_id: keyword,
  timestamp: date,
  article_text: text
}

Suppose I want to find all blogs with 2 or more different articles that mention X within the last 30 days. Is there a query to find all blog_ids that have different articles with the same word within a date range?

For example:

{
  "blog_id": "1"
  "blog_article_id": 10,
  "timestamp": 2020-01-02T00:00:00,
  "article_text": "... cups ..."
},
{
  "blog_id": "1"
  "blog_article_id": 11,
  "timestamp": 2020-01-20T00:00:00,
  "article_text": "... cups ..."
},
{
  "blog_id": "2"
  "blog_article_id": 10,
  "timestamp": 2020-01-20T00:00:00,
  "article_text": "... cups ..."
}

Searching for cups in date range [2020-01-01, 2020-01-30], blog_id 1 should be returned but not blog_id 2.

Is this the right way to model the problem or should I use a nested objects for an easier query?

Can this be made into a report in Kibana?

Upvotes: 0

Views: 160

Answers (1)

Kamal Kunjapur
Kamal Kunjapur

Reputation: 8840

This can be done by using the below set of queries, collated into a single ES request.

The way to do about this is, first filter the documents based on the timestamp i.e. via Range Query and post that you can apply Term Queries which would be like select distinct kinda of equivalent, and you can then add a Top Hits Aggregation query pipelined to Terms Query.

POST <your_index_name>/_size
{
  "size": 0, 
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "timestamp": {
              "gte": "2020-01-01",
              "lte": "2020-01-30"
            }
          }
        },
        {
          "match": {
            "article_text": "cups"
          }
        }
      ]
    }
  },
  "aggs": {
    "my_blog_ids": {
      "terms": {
        "field": "blog_id",
        "size": 100,                          <---- Term Size
        "min_doc_count": 2
      },
      "aggs": {
        "my_document_hits": {
          "top_hits": {
            "size": 10
          }
        },
        "bucket_count": {                     <---- bucket count
          "value_count": {
            "field": "_id"
          }
        }
      }
    }
  }
}

In the above there are two mentions.

First one is so that you only get the results of aggregation query and second one so that you only return those blog_ids whose count is > 1.

Below is the sample response:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_blog_ids" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "1",                        <---- blog_id 1
          "doc_count" : 2,
          "my_document_hits" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : 1.0,
              "hits" : [
                {
                  "_index" : "my_blog_index",
                  "_type" : "_doc",
                  "_id" : "1",                       <---- doc 1
                  "_score" : 1.0,
                  "_source" : {
                    "blog_id" : "1",
                    "blog_article_id" : 10,
                    "timestamp" : "2020-01-02T00:00:00",
                    "article_text" : "... cups ..."
                  }
                },
                {
                  "_index" : "my_blog_index",
                  "_type" : "_doc",
                  "_id" : "2",                       <---- doc 2
                  "_score" : 1.0,
                  "_source" : {
                    "blog_id" : "1",
                    "blog_article_id" : 11,
                    "timestamp" : "2020-01-20T00:00:00",
                    "article_text" : "... cups ..."
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Upvotes: 1

Related Questions