Reputation: 380
I have a string property called summary
that has analyzer
set to trigrams
and search_analyzer
set to words
.
"filter": {
"words_splitter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"english_words_filter": {
"type": "stop",
"stop_words": "_english_"
},
"trigrams_filter": {
"type": "ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"words": {
"filter": [
"lowercase",
"words_splitter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"trigrams": {
"filter": [
"lowercase",
"words_splitter",
"trigrams_filter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
I need that query strings given in input like React and HTML
(or React, html
) are being matched to documents that contain in the summary
the words React
, reactjs
, react.js
, html
, html5
. As more matching keywords they have, an higher score they have (I would expect lower scores on documents that have just a word matching not even at 100%, ideally).
The thing is, I guess at the moment react.js
is split in both react
and js
since I get all the documents that contain js
as well. On the other hand, Reactjs
returns nothing. I also think to need words_splitter
in order to ignore the comma.
Upvotes: 3
Views: 3314
Reputation: 380
I found a solution.
Basically I'm going to define the word_delimiter
filter with catenate_all
active
"words_splitter": {
"catenate_all": "true",
"type": "word_delimiter",
"preserve_original": "true"
}
giving it to the words
analyzer with a keyword
tokenizer
"words": {
"filter": [
"words_splitter"
],
"type": "custom",
"tokenizer": "keyword"
}
Calling http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js
I get the following tokens:
{
"tokens": [
{
"token": "react.js",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "react",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "reactjs",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "js",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}
Upvotes: 1
Reputation: 18874
You can solve the problem with names like react.js with a keyword marker filter and by defining the analyzer so that it uses the keyword filter. This will prevent react.js from being split into react and js tokens.
Here is an example configuration for the filter:
"filter": {
"keywords": {
"type": "keyword_marker",
"keywords": [
"react.js",
]
}
}
And the analyzer:
"analyzer": {
"main_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keywords",
"synonym_filter",
"german_stop",
"german_stemmer"
]
}
}
You can see whether your analyzer behaves as required using the analyze command:
GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"
This should return the following tokens where react.js is not tokenized:
{
"tokens": [
{
"token": "react.js",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 13,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "nice",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "library",
"start_offset": 20,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
For the words that are similar but not exactly the same as: React.js and Reactjs you could use a synonym filter. Do you have a fixed set of keywords that you want to match?
Upvotes: 1