Reputation:
We have a search engine for text content which contains strings like c++
or c#
. The switch to Elasticsearch has shown that the search does not match on terms like 'c++'. ++
is removed.
How can we teach elasticsearch to match correctly in a full text search and not to remove special characters? Characters like comma ,
should of course still be removed.
Upvotes: 1
Views: 250
Reputation: 32386
You need to create your own custom-analyzer which generates token as per your requirement, for your example I created a below custom analyzer with a text field name language
and indexed some sample docs:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
"replace_comma"
]
}
},
"char_filter": {
"replace_comma": {
"type": "mapping",
"mappings": [
", => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"language": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated for text like c++
, c#
and c,java
.
POST http://{{hostname}}:{{port}}/{{index}}/_analyze
{
"text" : "c#",
"analyzer": "my_analyzer"
}
{
"tokens": [
{
"token": "c#",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
for c,java
it generated 2 separate tokens c
and java
as it replaces ,
with whitespace shown below:
{
"text" : "c, java",
"analyzer":"my_analyzer"
}
{
"tokens": [
{
"token": "c",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "java",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}
Note: You need to understand the analysis process and accordingly modify your custom-analyzer to make it work for all of your use-case, My example might not work for all your edge cases, But hope you get an idea on how to handle such requirements.
Upvotes: 1