Roxas Shadow
Roxas Shadow

Reputation: 380

Handling the dot in ElasticSearch

I have a string property called summary that has analyzer set to trigrams and search_analyzer set to words.

"filter": {
    "words_splitter": {
        "type": "word_delimiter",
        "preserve_original": "true"
    },
    "english_words_filter": {
        "type": "stop",
        "stop_words": "_english_"
    },
    "trigrams_filter": {
        "type": "ngram",
        "min_gram": "2",
        "max_gram": "20"
    }
},
"analyzer": {
    "words": {
        "filter": [
            "lowercase",
            "words_splitter",
            "english_words_filter"
        ],
        "type": "custom",
        "tokenizer": "whitespace"
    },
    "trigrams": {
        "filter": [
            "lowercase",
            "words_splitter",
            "trigrams_filter",
            "english_words_filter"
        ],
        "type": "custom",
        "tokenizer": "whitespace"
    }
}

I need that query strings given in input like React and HTML (or React, html) are being matched to documents that contain in the summary the words React, reactjs, react.js, html, html5. As more matching keywords they have, an higher score they have (I would expect lower scores on documents that have just a word matching not even at 100%, ideally).

The thing is, I guess at the moment react.js is split in both react and js since I get all the documents that contain js as well. On the other hand, Reactjs returns nothing. I also think to need words_splitter in order to ignore the comma.

Upvotes: 3

Views: 3314

Answers (2)

Roxas Shadow
Roxas Shadow

Reputation: 380

I found a solution.

Basically I'm going to define the word_delimiter filter with catenate_all active

"words_splitter": {
  "catenate_all": "true",
  "type": "word_delimiter",
  "preserve_original": "true"
}

giving it to the words analyzer with a keyword tokenizer

"words": {
  "filter": [
      "words_splitter"
  ],
  "type": "custom",
  "tokenizer": "keyword"
}

Calling http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js I get the following tokens:

{
"tokens": [
    {
        "token": "react.js",
        "start_offset": 0,
        "end_offset": 8,
        "type": "word",
        "position": 0
    },
    {
        "token": "react",
        "start_offset": 0,
        "end_offset": 5,
        "type": "word",
        "position": 0
    },
    {
        "token": "reactjs",
        "start_offset": 0,
        "end_offset": 8,
        "type": "word",
        "position": 0
    },
    {
        "token": "js",
        "start_offset": 6,
        "end_offset": 8,
        "type": "word",
        "position": 1
    }
  ]
}

Upvotes: 1

paweloque
paweloque

Reputation: 18874

You can solve the problem with names like react.js with a keyword marker filter and by defining the analyzer so that it uses the keyword filter. This will prevent react.js from being split into react and js tokens.

Here is an example configuration for the filter:

     "filter": {
        "keywords": {
           "type": "keyword_marker",
           "keywords": [
              "react.js",
           ]
        }
     }

And the analyzer:

     "analyzer": {
        "main_analyzer": {
           "type": "custom",
           "tokenizer": "standard",
           "filter": [
              "lowercase",
              "keywords",
              "synonym_filter",
              "german_stop",
              "german_stemmer"
           ]
        }
     }

You can see whether your analyzer behaves as required using the analyze command:

GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"

This should return the following tokens where react.js is not tokenized:

{
   "tokens": [
      {
         "token": "react.js",
         "start_offset": 1,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "is",
         "start_offset": 10,
         "end_offset": 12,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 13,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "nice",
         "start_offset": 15,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "library",
         "start_offset": 20,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

For the words that are similar but not exactly the same as: React.js and Reactjs you could use a synonym filter. Do you have a fixed set of keywords that you want to match?

Upvotes: 1

Related Questions