Mayur Buragohain
Mayur Buragohain

Reputation: 1615

Elasticsearch path_hierarchy tokenizes half of the path

I am trying to index a path using path_hierarchy tokenizer but it seems to be tokenize-ing only half of the path i providing. I have tried with different paths and the results seem to be the same.

My settings are -

{
    "settings" : { 
        "number_of_shards" : 5,
        "number_of_replicas" : 0,
        "analysis":{
            "analyzer":{
                "keylower":{
                    "type": "custom",
                    "tokenizer":"keyword",
                    "filter":"lowercase"
                },
                "path_analyzer": {
                    "type": "custom",
                    "tokenizer": "path_tokenizer",
                    "filter": [ "lowercase", "asciifolding", "path_ngrams" ]
                },
                "code_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [ "lowercase", "asciifolding", "code_stemmer" ]
                },
                "not_analyzed": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "lowercase", "asciifolding", "code_stemmer" ]
                }
            },
            "tokenizer": {
                "path_tokenizer": {
                  "type": "path_hierarchy"
                }
            },
            "filter": {
                "path_ngrams": {
                    "type": "edgeNGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "code_stemmer": {
                    "type": "stemmer",
                    "name": "minimal_english"
                }
            }
        }
    }
}

My mapping is as follows -

{
  "dynamic": "strict",
  "properties": {
    "depot_path": {
      "type": "string",
      "analyzer": "path_analyzer"
    }
  },
  "_all": {
      "store": "yes",
      "analyzer": "english"
  }
}

I provided "//cm/mirror/v1.2/Kolkata/ixin-packages/builds/" as depot_path on analyzing I have found that tokens formed as follows -

               "key": "//c",
               "key": "//cm",
               "key": "//cm/",
               "key": "//cm/m",
               "key": "//cm/mi",
               "key": "//cm/mir",
               "key": "//cm/mirr",
               "key": "//cm/mirro",
               "key": "//cm/mirror",
               "key": "//cm/mirror/",
               "key": "//cm/mirror/v",
               "key": "//cm/mirror/v1",
               "key": "//cm/mirror/v1.",

Why is it that the entire path is not tokenized?

My expected output is to have tokens formed all the way upto //cm/mirror/v1.2/Kolkata/ixin-packages/builds/

I have tried increasing the buffer-size but no luck. Does anyone know what is it that I'm doing wrong?

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html,

Upvotes: 1

Views: 122

Answers (2)

bittusarkar
bittusarkar

Reputation: 6357

This is because you've set the value of "max_gram" to 15. Hence you'll notice that the largest token generated ("//cm/mirror/v1.") is of length 15. Change it to a very large number and you'll get your desired tokens.

Upvotes: 1

Shubhangi
Shubhangi

Reputation: 2254

"max_gram": 15 is limiting token size to 15. If you increase "max_gram" , you would see further path will be tokenized.

Below is example from my env.

"max_gram" :15 
input path : /var/log/www/html/web/
path_analyser tokenized this path upto /var/log/www/ht i.e. 15 characters


 "max_gram" :100
    input path : /var/log/www/html/web/WANTED
    path_analyser tokenized this path upto /var/log/www/html/web/WANTED i.e. 28  characters <100

Upvotes: 1

Related Questions