Get exact match after doing mapping as not_analyzed

Question

I have elasticsearch type I mapped as below,

mappings": {
 "jardata": {
   "properties": {
     "groupID": {
      "index": "not_analyzed",
      "type": "string"
      },
     "artifactID": {
     "index": "not_analyzed",
     "type": "string"
      },
      "directory": {
      "type": "string"
      },
      "jarFileName": {
      "index": "not_analyzed",
      "type": "string"
      },
      "version": {
      "index": "not_analyzed",
      "type": "string"
      }
    }
  }
}

I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.

The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?

Joanna Mamczynska · Accepted Answer

Let's say you have the following document indexed:

{
    "directory": "/home/docs/public"
}

The standard analyzer is not enough in your case as it will create following terms while indexing:

[home, docs, public]

Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.

One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:

[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]

Index mapping:

{
    "settings": {
        "analysis": {
          "analyzer": {
            "ngram_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "ngram",
              "min_gram": 4,
              "max_gram": 18,
              "token_chars": [
                "letter",
                "digit",
                "punctuation"
              ]
            }
          }
        }
      },
    "mappings": {
     "jardata": {
       "properties": {
          "directory": {
          "type": "string",
          "analyzer": "ngram_analyzer"
          }
        }
      }
    }
}

Now both search queries:

{
    "query": {
      "bool" : {
        "must" : {
          "term" : {
             "directory": "/docs/private"
           }
        }
      }
    }
}

and

{
    "query": {
      "bool" : {
        "must" : {
          "term" : {
             "directory": "/home/docs/private"
           }
        }
      }
    }
}

will give the indexed document in result.

One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.

Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:

{
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": { 
                    "type": "ngram",
                    "min_gram": 4,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type":      "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "ngram_filter" 
                    ]
                }
            }
        }
    },
  "mappings": {
   "jardata": {
     "properties": {
        "directory": {
        "type": "string",
        "analyzer": "my_analyzer"
        }
      }
    }
  }
}

Get exact match after doing mapping as not_analyzed

Answers (2)

Related Questions