pred1212
pred1212

Reputation: 3

How to get unique field values based on recent and highest priority documents in Elasticsearch?

I have an Elasticsearch index that stores documents with the following fields:

I need to find 5 unique user_id values from the most recent and highest-priority documents. If I were working with SQL, I would solve this by:

SELECT DISTINCT user_id
FROM (
    SELECT user_id
    FROM documents
    ORDER BY timestamp DESC, priority DESC
) AS sorted_docs
LIMIT 5;
  1. Sorting the data with both timestamp and priority fields in descending order.
  2. Selecting distinct user_id, picking the first 5 from the sorted result.

However, I am unsure how to perform a similar operation in Elasticsearch. I have looked into using terms and composite aggregations, but I'm not clear on:

  1. Ensuring the data is sorted by timestamp and priority before selecting unique user_id.
  2. Limiting the result to the first 5 unique user_id values based on this sorting.

Can someone guide me on how to structure this query or provide an example? Any help or references would be greatly appreciated!

Upvotes: 0

Views: 17

Answers (1)

Sagar Patel
Sagar Patel

Reputation: 5486

You have not shared any sample data but based on the information you have given, I have created sample data and provided sample query.

You can use Collapse functionality from Elasticsearch to get distinct value for user_id.

Let consider below is your index mapping:

PUT test
{
  "mappings": {
    "properties": {
      "user_id":{
        "type": "keyword"
      },
      "timestamp":{
        "type": "date"
      },
     "priority":{
        "type": "integer"
      }
    }
  }
}

Below is sample data:

{ "index" : { "_index" : "test", "_id" : "1" } }
{ "user_id" : "101", "timestamp" : "2024-11-05T12:10:30Z","priority" : "1"}
{ "index" : { "_index" : "test", "_id" : "2" } }
{ "user_id" : "101", "timestamp" : "2024-11-05T16:10:30Z","priority" : "2"}
{ "index" : { "_index" : "test", "_id" : "3" } }
{ "user_id" : "102", "timestamp" : "2024-11-04T12:10:30Z","priority" : "3"}
{ "index" : { "_index" : "test", "_id" : "4" } }
{ "user_id" : "102", "timestamp" : "2024-11-05T08:10:30Z","priority" : "4"}
{ "index" : { "_index" : "test", "_id" : "5" } }
{ "user_id" : "103", "timestamp" : "2024-11-05T12:10:30Z","priority" : "5"}
{ "index" : { "_index" : "test", "_id" : "6" } }
{ "user_id" : "103", "timestamp" : "2024-11-03T12:10:30Z","priority" : "6"}
{ "index" : { "_index" : "test", "_id" : "7" } }
{ "user_id" : "104", "timestamp" : "2024-11-05T12:10:30Z","priority" : "7"}
{ "index" : { "_index" : "test", "_id" : "8" } }
{ "user_id" : "105", "timestamp" : "2024-11-04T12:10:30Z","priority" : "8"}
{ "index" : { "_index" : "test", "_id" : "9" } }
{ "user_id" : "105", "timestamp" : "2024-11-01T12:10:30Z","priority" : "9"}
{ "index" : { "_index" : "test", "_id" : "10" } }
{ "user_id" : "106", "timestamp" : "2024-11-02T12:10:30Z","priority" : "10"}

You can use below query which will sort on timestamp and priority as well as collapse on user_id which will make sure you will get all the unique user ids:

POST test/_search
{
  "size": 5, 
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    },
    {
      "priority": {
        "order": "desc"
      }
    }
  ], 
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "user_id"
  }
}

Below will be response for above query:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "test",
        "_id": "2",
        "_score": null,
        "_source": {
          "user_id": "101",
          "timestamp": "2024-11-05T16:10:30Z",
          "priority": "2"
        },
        "fields": {
          "user_id": [
            "101"
          ]
        },
        "sort": [
          1730823030000,
          2
        ]
      },
      {
        "_index": "test",
        "_id": "7",
        "_score": null,
        "_source": {
          "user_id": "104",
          "timestamp": "2024-11-05T12:10:30Z",
          "priority": "7"
        },
        "fields": {
          "user_id": [
            "104"
          ]
        },
        "sort": [
          1730808630000,
          7
        ]
      },
      {
        "_index": "test",
        "_id": "5",
        "_score": null,
        "_source": {
          "user_id": "103",
          "timestamp": "2024-11-05T12:10:30Z",
          "priority": "5"
        },
        "fields": {
          "user_id": [
            "103"
          ]
        },
        "sort": [
          1730808630000,
          5
        ]
      },
      {
        "_index": "test",
        "_id": "4",
        "_score": null,
        "_source": {
          "user_id": "102",
          "timestamp": "2024-11-05T08:10:30Z",
          "priority": "4"
        },
        "fields": {
          "user_id": [
            "102"
          ]
        },
        "sort": [
          1730794230000,
          4
        ]
      },
      {
        "_index": "test",
        "_id": "8",
        "_score": null,
        "_source": {
          "user_id": "105",
          "timestamp": "2024-11-04T12:10:30Z",
          "priority": "8"
        },
        "fields": {
          "user_id": [
            "105"
          ]
        },
        "sort": [
          1730722230000,
          8
        ]
      }
    ]
  }
}

I have consider 10 as highest priority as you didn't clarified in your question. If you want 1 as highest priority then change priority order to asc from desc in sort.

Upvotes: 0

Related Questions