Rife
Rife

Reputation: 555

Search with exact term position with elasticsearch

I use whitespace analyzer to indexing my field named hash, so my field text '1 2 3 4 5' will be index to five terms [1, 2, 3, 4, 5] .

My question is how to match with exact term potion ? for example, accuracy is greater than 4/5 , '2 1 3 4 5' will not match, '8 2 3 4 5' will match. How to do that?

Spliting into five field is ok , but I want just one field .

Upvotes: 2

Views: 697

Answers (2)

Rife
Rife

Reputation: 555

Use whitespace analyzer, make position as a part of text value, change '1 2 3 4 5' to '0_1 1_2 2_3 3_4 4_5' before index, 0_1 means position is 0 and value is 1. It's one field indexed, but still need multi-terms query when search .

query '8 2 3 4 5' :

should: [
    { term: { hash: '0_8' } },
    { term: { hash: '1_2' } },
    { term: { hash: '2_3' } },
    { term: { hash: '3_4' } },
    { term: { hash: '4_5' } },
],
minimum_should_match: 4

Upvotes: 0

Pierre Mallet
Pierre Mallet

Reputation: 7221

You can use a combination of shingle token filter and minimum should match at query time :

Explanation :

With a shingle token filter "1 2 3 4 5" can be transformed into this token stream :

{
  "tokens": [
    {
      "token": "1 2",
      "start_offset": 0,
      "end_offset": 3,
      "type": "shingle",
      "position": 0
    },
    {
      "token": "2 3",
      "start_offset": 2,
      "end_offset": 5,
      "type": "shingle",
      "position": 1
    },
    {
      "token": "3 4",
      "start_offset": 4,
      "end_offset": 7,
      "type": "shingle",
      "position": 2
    },
    {
      "token": "4 5",
      "start_offset": 6,
      "end_offset": 9,
      "type": "shingle",
      "position": 3
    }
  ]
}

The same applies to your query. So shingle token will only match if numbers are in the correct order. The usage of minimu_should_match will control the pourcentage of token of the query that need to match in the document.

So here is the example :

In the mapping we configure the shingle filter and an analyzer using it

PUT so_54684997
{
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "myShingledAnalyzer"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "myShingle": {
          "type": "shingle",
          "output_unigrams": false
        }
      },
      "analyzer": {
        "myShingledAnalyzer": {
          "tokenizer": "whitespace",
          "filter": ["myShingle"]
        }
      }
    }
  }
}

We add the document

PUT so_54684997/_doc/1
{
  "content": "1 2 3 4 5"
}

Query 1 => Don't match (all number but no 4/5 in the same order)

POST so_54684997/_search
{
  "query": {
    "match": {
      "content": {
        "query": "2 1 3 4 5",
        "minimum_should_match": "80%"
      }
    }
  }
}

Query 2 => Match (4 of 5 number but in the good order)

POST so_54684997/_search
{
  "query": {
    "match": {
      "content": {
        "query": "1 2 3 4",
        "minimum_should_match": "80%"
      }
    }
  }
}

Query 3 => Match (4 of 5 number in the same order)

POST so_54684997/_search
{
  "query": {
    "match": {
      "content": {
        "query": "8 2 3 4 5",
        "minimum_should_match": "80%"
      }
    }
  }
}

I dont know if this will handle all your cases but i think its a good hint to start !

Upvotes: 2

Related Questions