What is the best way to handle common term which contains special chars, like C#, C++

Question

I have some documents contains c# or c++ in title which use standard analyzer. When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.

What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".

Pavel Vasilev · Accepted Answer

Here is approach you can use:

Introduce copy-field where you will have values with special characters. For that you'll need:

Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):

PUT my_index     
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{ 
               "type":"custom",
               "tokenizer":"whitespace",
               "filter":[
                  "lowercase"
               ]
            }
         }
      }
   }
}

Create copy-field (_wcc suffix will stand for 'with special characters'):

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "prog_lang": {
          "type": "text",
          "copy_to": "prog_lang_wcc",
          "analyzer": "standard"
        },
        "prog_lang_wcc": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):

GET /_search
{
  "query": {
    "multi_match" : {
      "query" : "c#",
      "type": "match_phrase",
      "fields" : [ "prog_lang_wcc^3", "prog_lang" ] 
    }
  }
}

What is the best way to handle common term which contains special chars, like C#, C++

Answers (1)

Related Questions