Reputation: 11137
Elastic newbie here. I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with 2 or more different articles that mention X
within the last 30 days. Is there a query to find all blog_id
s that have different articles with the same word within a date range?
For example:
{
"blog_id": "1"
"blog_article_id": 10,
"timestamp": 2020-01-02T00:00:00,
"article_text": "... cups ..."
},
{
"blog_id": "1"
"blog_article_id": 11,
"timestamp": 2020-01-20T00:00:00,
"article_text": "... cups ..."
},
{
"blog_id": "2"
"blog_article_id": 10,
"timestamp": 2020-01-20T00:00:00,
"article_text": "... cups ..."
}
Searching for cups
in date range [2020-01-01
, 2020-01-30
], blog_id
1 should be returned but not blog_id
2.
Is this the right way to model the problem or should I use a nested
objects for an easier query?
Can this be made into a report in Kibana?
Upvotes: 0
Views: 160
Reputation: 8840
This can be done by using the below set of queries, collated into a single ES request.
The way to do about this is, first filter the documents based on the timestamp
i.e. via Range Query and post that you can apply Term Queries
which would be like select distinct kinda of equivalent, and you can then add a Top Hits Aggregation
query pipelined to Terms Query.
POST <your_index_name>/_size
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "2020-01-01",
"lte": "2020-01-30"
}
}
},
{
"match": {
"article_text": "cups"
}
}
]
}
},
"aggs": {
"my_blog_ids": {
"terms": {
"field": "blog_id",
"size": 100, <---- Term Size
"min_doc_count": 2
},
"aggs": {
"my_document_hits": {
"top_hits": {
"size": 10
}
},
"bucket_count": { <---- bucket count
"value_count": {
"field": "_id"
}
}
}
}
}
}
In the above there are two mentions.
First one is so that you only get the results of aggregation query and second one so that you only return those blog_ids whose count is > 1.
Below is the sample response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_blog_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1", <---- blog_id 1
"doc_count" : 2,
"my_document_hits" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_blog_index",
"_type" : "_doc",
"_id" : "1", <---- doc 1
"_score" : 1.0,
"_source" : {
"blog_id" : "1",
"blog_article_id" : 10,
"timestamp" : "2020-01-02T00:00:00",
"article_text" : "... cups ..."
}
},
{
"_index" : "my_blog_index",
"_type" : "_doc",
"_id" : "2", <---- doc 2
"_score" : 1.0,
"_source" : {
"blog_id" : "1",
"blog_article_id" : 11,
"timestamp" : "2020-01-20T00:00:00",
"article_text" : "... cups ..."
}
}
]
}
}
}
]
}
}
}
Upvotes: 1