Reputation: 3593
I have a minhash field generated for some text (based on minhash algorithm), now my question is, is it possible to somehow complement or add the prefix query with wildcards? Because the problem is, the hashed string values are based on the content (text) position of the shingles/tokens. So the first few characters (prefix) might not always exactly match similar content. Would it be possible to add a wildcard, e.g *3AF8659GJ in front of the prefix for a query?
EDIT: I guess I wasnt thinking hard enough about the problem. The hash differences can be anywhere in the hash-string (based on text differences in the content position of the difference of the text). So I guess the "best" only way would be edit distance and some threshhold.
E.g put all hashes into an array and sort them in lexical order (or how would you sort Hex-strings?) and then you only compare the next k documents until the edit-distance threshold is reached, and put the duplicates in a separate array..
Upvotes: 0
Views: 421
Reputation: 217544
Searching by suffixes is highly discouraged for performance reasons, as explained in the official document:
In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?
There's still a way to achieve what you want by using a cleverly crafted analyzer. The idea is to index only the end of the minhash. You can achieve it as described below.
First, create an index with the following analyzer:
PUT minhash-index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"suffix": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"reverse",
"substring",
"reverse"
]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"minhash": {
"type": "text",
"analyzer": "suffix",
"search_analyzer": "standard"
}
}
}
}
}
The idea of the suffix
analyzer is that it will index all suffixes of length 1 to 10 (you can decide to index longer suffixes) for each minhash that you thrown into your index.
So for instance, for the minhash C50FD711C2C43287351892A4D82F44B055F048C46D2C54197AC1D1E921F11E6699C4057C4B93907518E6DCA51A672D3D3E419160DAE276CB7716D11B94D8C3BB2E4A591329B7AF973D17A7F9336342FFAAFD4D
, it will index all the following suffixes:
d
4d
d4d
fd4d
afd4d
aafd4d
faffd4d
ffaafd4d
2ffaafd4d
42ffaafd4d
Then you can easily search and find the above minhash with the following query:
POST minhash-index/_search
{
"query": {
"match": {
"minhash": "42FFAAFD4D"
}
}
}
Upvotes: 1