MarcL
MarcL

Reputation: 3593

Elasticsearch minhash prefix query with wildcards?

I have a minhash field generated for some text (based on minhash algorithm), now my question is, is it possible to somehow complement or add the prefix query with wildcards? Because the problem is, the hashed string values are based on the content (text) position of the shingles/tokens. So the first few characters (prefix) might not always exactly match similar content. Would it be possible to add a wildcard, e.g *3AF8659GJ in front of the prefix for a query?

EDIT: I guess I wasnt thinking hard enough about the problem. The hash differences can be anywhere in the hash-string (based on text differences in the content position of the difference of the text). So I guess the "best" only way would be edit distance and some threshhold.

E.g put all hashes into an array and sort them in lexical order (or how would you sort Hex-strings?) and then you only compare the next k documents until the edit-distance threshold is reached, and put the duplicates in a separate array..

Upvotes: 0

Views: 421

Answers (1)

Val
Val

Reputation: 217544

Searching by suffixes is highly discouraged for performance reasons, as explained in the official document:

In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?

There's still a way to achieve what you want by using a cleverly crafted analyzer. The idea is to index only the end of the minhash. You can achieve it as described below.

First, create an index with the following analyzer:

PUT minhash-index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "suffix": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": [
              "lowercase",
              "reverse",
              "substring",
              "reverse"
            ]
          }
        },
        "filter": {
          "substring": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 10
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "minhash": {
          "type": "text",
          "analyzer": "suffix",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

The idea of the suffix analyzer is that it will index all suffixes of length 1 to 10 (you can decide to index longer suffixes) for each minhash that you thrown into your index.

So for instance, for the minhash C50FD711C2C43287351892A4D82F44B055F048C46D2C54197AC1D1E921F11E6699C4057C4B93907518E6DCA51A672D3D3E419160DAE276CB7716D11B94D8C3BB2E4A591329B7AF973D17A7F9336342FFAAFD4D, it will index all the following suffixes:

  • d
  • 4d
  • d4d
  • fd4d
  • afd4d
  • aafd4d
  • faffd4d
  • ffaafd4d
  • 2ffaafd4d
  • 42ffaafd4d

Then you can easily search and find the above minhash with the following query:

POST minhash-index/_search
{
  "query": {
    "match": {
      "minhash": "42FFAAFD4D"
    }
  }
}

Upvotes: 1

Related Questions