Reputation: 14734
I am trying to return a randomized result of a filter query to give all my documents a fair chance of being on the first page results. In an effort to not confuse users during repeated searches (and to easily support pagination) the results should stay consistent for the current day.
To do this I have developed the following script sort query. It combines the document id (a guid, so already fairly random) with a daily salt (just the day of year and current year combined) and hashes the result to produce what I would expect to be a fairly random string, that only changes as the 'daily salt' changes each day (ignore the extraneous elements in this specific query, it's generated from code).
{
"from": 0,
"size": 20,
"sort": {
"_script": {
"order": "asc",
"type": "string",
"script": "org.elasticsearch.common.Digest.md5Hex(dailySalt + doc['id'].value)",
"params": {
"dailySalt": "184-2013"
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"tag_id": "Some Tag"
}
},
{
"match_all": {}
}
]
}
}
}
},
"fields": [
"id"
]
}
Inspired from this similar question and answer
It works, but not very well. I get slightly different results as I increment the daily salt, but the same documents keep appearing around the top results. They move slightly, but there's definitely a consistent pattern.
I've tried to change the hash function to another I found:
org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction.DJB_HASH
but it gives very similar results of common top results.
I'm no cryptography expert so I presume this is a behavior of common hash functions and there must be some special hash functions to use for more randomized results based on similar inputs?
Is anyone familiar with one available in ElasticSearch? I'm using Searchbox.io (cloud hosted elastic search service) so installing my own custom function is not an option.
Or am I approaching this problem from a completely wrong angle?
Edit I just looked at the resultant sort keys produced by the script, and it appears that the script is only being applied to the first page of results, and then sorting that first page (rather than applying to the full result set and therefor changing the documents within the first page).
Here's my first page results (edited for brevity). But you can see on the first page alone that the sort key varies from 0c*** to fa***, for the first 0 - 20 docs, with a total of ~200 docs.
Using 'dailySalt' = 185-2013
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 196,
"max_score": null,
"hits": [{
"fields": {
"id": "27662ef8-d2a7-4fde-80f6-1571b83c4cde"
},
"sort": ["0cbf8b4e7927f0a53a5b82f2630ff9ad"]
}, {
"fields": {
"id": "d9b11797-053f-495e-a676-0ec959dba879"
},
"sort": ["0fa8730a5239f8a3d1286cbe16619bfa"]
}, {
"fields": {
"id": "482c893f-1083-4860-892e-1b25cf442199"
},
"sort": ["295edd71cc48ac41c5e2f91315abf5ce"]
}, {
"fields": {
"id": "581fd0f1-9ecb-4e5c-920b-06413bfbf4f7"
},
"sort": ["4b9f0d17bc2333d13a1963b4f6afb829"]
}, {
"fields": {
"id": "de3dddb8-e296-4446-ac4c-135cc925669d"
},
"sort": ["4c5d0bcb50f5b600e539ba46b33b1007"]
}, {
"fields": {
"id": "c83ad22e-80b4-40f1-8e56-2153a1a1f9e8"
},
"sort": ["55efe0a692ab3205405f1c74732b8205"]
}, {
"fields": {
"id": "7bd19829-4f37-4e02-9fd1-0239b8ae8db4"
},
"sort": ["5adcd22c7c507244d7ba382812accdf3"]
}, {
"fields": {
"id": "42fcec43-851f-4133-a8db-1d2bf0b86ec8"
},
"sort": ["6757f46bd554e3353a2ebf35c6b3d24c"]
}, {
"fields": {
"id": "e119132b-4e93-4047-8513-1ce2452f0cdd"
},
"sort": ["6dbcb59a2b5e91523896d57695251b29"]
}, {
"fields": {
"id": "7d0acf5d-7c14-45a2-97b7-17939ff512f4"
},
"sort": ["9d99752ec0802e55dcfb3c83bcd2e4bb"]
}, {
"fields": {
"id": "2cdc21e4-3312-460b-9a18-094e4f95a56c"
},
"sort": ["9dc43d1d39e64cfe04c6d7b8f565faaa"]
}, {
"fields": {
"id": "0f665cb3-5648-416c-b08f-146d2a019319"
},
"sort": ["b61bb718fe63a287b6fcdc8bcd638604"]
}, {
"fields": {
"id": "1e852d49-2b3b-4d7a-9f1b-1495b94e723e"
},
"sort": ["ba7ad8a3a6e195a6bc28e341f9d6965b"]
}, {
"fields": {
"id": "ca2a5922-bb42-4317-b61c-129925436a1f"
},
"sort": ["bca0411cf8d67b4dcd5b205a5010367f"]
}, {
"fields": {
"id": "b1dac760-7d73-4b60-bd6d-08ea9453e68c"
},
"sort": ["be3714cfb2517e98d525aaea6e40cfa5"]
}, {
"fields": {
"id": "c4b08def-59db-4ac0-b16f-0c3fae4c01f2"
},
"sort": ["c4220b31c305d536c7a7d1639da32c66"]
}, {
"fields": {
"id": "cc7ac1fd-3e88-4503-a837-2000ebb6e2d9"
},
"sort": ["ceb5710fe2418fe3b353bf7b1f737570"]
}, {
"fields": {
"id": "5a5f90c9-b44f-4ca2-9d16-117c8e9fd388"
},
"sort": ["dc5fea76598633cb08c1459983ebca62"]
}, {
"fields": {
"id": "6d811d5b-4138-4a41-a186-1b9aa2b65623"
},
"sort": ["ea3c55ac123ac9e819b145402407d1de"]
}, {
"fields": {
"id": "b489d2da-b4a1-44de-acde-219109edd42f"
},
"sort": ["fab53cc11983b45b081d4b01df555c59"]
}]
}
}
Upvotes: 3
Views: 2680
Reputation: 14734
So it looks like this is a problem with Elastic Search's script sort. It was only applying the script to the 1st page results, and then sorting that 1 page only. I can't find this behavior documented anywhere, so not sure if it's a bug or by-design for performance reasons.
Anyway, using a custom_score query, with a similar script for the score function, gives the desired results (I used DJB hash instead of original MD5 due to speed, designed for strings, and returns an int which is needed for custom score scripts):
{
"from": 0,
"size": 20,
"query": {
"custom_score": {
"script": "org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction.DJB_HASH(dailySalt + doc['id'].value)",
"params": {
"dailySalt": "185-2013"
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"tag_id": "Some Tag"
}
},
{
"match_all": {}
}
]
}
}
}
}
}
},
"fields": [
"id"
]
}
Upvotes: 2
Reputation: 5744
MD5 is a cryptographic hash function, and such functions exhibit a property called the avalanche effect. In short, even a small change in the input completely changes the output. You can experiment with this using an online MD5 calculator for example.
There are no chance collisions either – you wouldn't find two strings that hash to the same value even if you spent your whole life trying to find them.
These two properties guarantee that the script you have written produces a unique, random value for every document every day, forever. The hash function is not the issue here.
How many documents are there? How many results are shown per page? What kinds of patterns do see? There may be a surprisingly large chance that a certain document appears on the front page by random.
Upvotes: 2