Youxu
Youxu

Reputation: 1110

What is the best way to handle common term which contains special chars, like C#, C++

I have some documents contains c# or c++ in title which use standard analyzer. When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.

What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".

Upvotes: 1

Views: 48

Answers (1)

Pavel Vasilev
Pavel Vasilev

Reputation: 1042

Here is approach you can use:

  1. Introduce copy-field where you will have values with special characters. For that you'll need:

    • Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):

      PUT my_index     
      {
         "settings":{
            "analysis":{
               "analyzer":{
                  "my_analyzer":{ 
                     "type":"custom",
                     "tokenizer":"whitespace",
                     "filter":[
                        "lowercase"
                     ]
                  }
               }
            }
         }
      }
      
    • Create copy-field (_wcc suffix will stand for 'with special characters'):

      PUT my_index
      {
        "mappings": {
          "my_type": {
            "properties": {
              "prog_lang": {
                "type": "text",
                "copy_to": "prog_lang_wcc",
                "analyzer": "standard"
              },
              "prog_lang_wcc": {
                "type": "text",
                "analyzer": "my_analyzer"
              }
            }
          }
        }
      }
      
  2. When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):

    GET /_search
    {
      "query": {
        "multi_match" : {
          "query" : "c#",
          "type": "match_phrase",
          "fields" : [ "prog_lang_wcc^3", "prog_lang" ] 
        }
      }
    }
    

Upvotes: 0

Related Questions