Jørgen Tvedt
Jørgen Tvedt

Reputation: 1284

ElasticSearch / Kibana - buckets for `terms` that does not exist in selected time-frame

I have a very common problem, in that I need to show which users or document-categories, all given by a keyword column, that are not present in a given time-interval. I default to use the Terms aggregation, which obviously does not return anything for the missing entries.

This is very simple problem in a relational database, just do an outer join from the user table. In Kibana/ElasticSearch I cannot figure out how to solve this.

On way that works is to switch to Filter and then copy and paste all users into individual filter specifications. That, however, can't be maintained, and does not scale with multiple reports.

I am fine with having to have a single example document for each Term, even if it's just a dummy. This would show all items when selecting the item in Kibana auto-complete, etc. If I could then get the results to always include at least one bucket from each of these terms - the problem would have been solved.

Example, the Kibana Y axis is a simple count, while the x axis should show the users with the least entries. The report is set to show data for Period 2:

User   |       Period 1        |      Period 2     |
MR_X   | o    o o o        o o |   o      o  o   o |
MISS_Y |     o         o   o   |       o           |
MR_Z   |  o      o      o      |                   |
MISS_W |                       |                   |

In this example, the report for Period 2 should at least show MISS_Y, and MR_Z as these are known in the dataset and have the fewest entries in Period 2. Some way to include MISS_W, which does not have any entries in the dataset would be a bonus.

Upvotes: 1

Views: 897

Answers (1)

avik
avik

Reputation: 2708

Apologies in advance if I've misunderstood your question. Aggregations provide a way to get different distributions of the documents in your result set. If you want different aggregations for different time intervals, you'll need your query to return results for all your time intervals, and you'll need to filter on different intervals within each of your aggregations.

For example, if you have the following:

  • A field called timestamp that you are using to specify your time interval
  • A field called user that you want to aggregate over
  • The time frame for your report (aka period 2 from your question) is the last 1 hour
  • Period 1 is everything before the last 1 hour

Then you could try structuring your Elasticsearch query as follows

GET myindex/_search
{
  ...
  "aggs": {
    "period-2-distribution": {
      "filter": {
        "range": {
          "timestamp": {
            "gte": "now-1h"
          }
        }
      }, 
      "aggs": {
        "user-agg": {
          "terms": {
            "field": "user",
            "size": 1000
          }
        }
      }
    },
    "period-1-distribution": {
      "filter": {
        "range": {
          "timestamp": {
            "lt": "now-1h"
          }
        }
      }, 
      "aggs": {
        "user-agg": {
          "terms": {
            "field": "user",
            "size": 1000
          }
        }
      }
    }    
  }
}

To reiterate, if you currently have a query before your aggs block, then you'll need to remove any clause from within query that specifies a time interval. This is admittedly a very invasive change to your query, and I appreciate it might break another one of your requirements. In this case you will need to take a different approach, but Elasticsearch is fairly flexible and should hopefully provide you a way to get what you want.

Upvotes: 1

Related Questions