user4144415
user4144415

Reputation:

How to optimize elasticsearch's full text search to match strings like 'C++'

We have a search engine for text content which contains strings like c++ or c#. The switch to Elasticsearch has shown that the search does not match on terms like 'c++'. ++ is removed.

How can we teach elasticsearch to match correctly in a full text search and not to remove special characters? Characters like comma , should of course still be removed.

Upvotes: 1

Views: 250

Answers (1)

Amit
Amit

Reputation: 32386

You need to create your own custom-analyzer which generates token as per your requirement, for your example I created a below custom analyzer with a text field name language and indexed some sample docs:

Index creation with a custom analyzer

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "replace_comma"
                    ]
                }
            },
            "char_filter": {
                "replace_comma": {
                    "type": "mapping",
                    "mappings": [
                        ", => \\u0020"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "language": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

Tokens generated for text like c++, c# and c,java.

POST http://{{hostname}}:{{port}}/{{index}}/_analyze

{
  "text" : "c#",
  "analyzer": "my_analyzer"
}

{
    "tokens": [
        {
            "token": "c#",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        }
    ]
}

for c,java it generated 2 separate tokens c and java as it replaces , with whitespace shown below:

{
  "text" : "c, java",
  "analyzer":"my_analyzer"
}

{
    "tokens": [
        {
            "token": "c",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "java",
            "start_offset": 3,
            "end_offset": 7,
            "type": "word",
            "position": 1
        }
    ]
}

Note: You need to understand the analysis process and accordingly modify your custom-analyzer to make it work for all of your use-case, My example might not work for all your edge cases, But hope you get an idea on how to handle such requirements.

Upvotes: 1

Related Questions