Elasticsearch strange filter behaviour

Question

I'm trying to replace a particular string inside a field. So I used custom analyser and character filter just as it's described in the docs, but it didn't work.
Here are my index settings:

{
    "settings": {
        "analysis": {
            "char_filter": {
                "doule_colon_to_space": {
                    "type":       "mapping",
                    "mappings": [ "::=> "]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "doule_colon_to_space" ],
                    "tokenizer":    "standard"
            }}
}}}

which should replace all double colons (::) in a field with spaces.
I then update my mapping to use the analyzer:

{
    "posts": {
        "properties": {
          "id": {
            "type": "long"
          },
          "title": {
            "type": "string", 
            "analyzer": "my_analyzer",
            "fields": {
                "simple": {
                    "type": "string", 
                    "index": "not_analyzed"
                }
            }
          }
        }
      }
}

Then I put a document in the index:

{
    "id": 1, 
    "title": "Person::Bruce Wayne"
}

I then test if analyzer works, but it appears it's not - when I send this https://localhost:/first_test/_analyze?analyzer=my_analyzer&text=Person::Someone+Close, I got two tokens back - 'PersonSomeone' (together) and 'Close'. Am I doing this right? May be I should escape the space somehow? I use Elasticsearch 1.3.4

Dusty · Accepted Answer

I think the whitespace in your char_filter pattern is being ignored. Try using the unicode escape sequence for a single space instead:

"mappings": [ "::=>\u0020"]

Update:

In response to your comment, the short answer is yes, the example is wrong. The docs do suggest that you can use a mapping character filter to replace a token with another one which is padded by whitespace, but the code disagrees.

The source code for the MappingCharFilterFactory uses this regex to parse the settings:

// source => target
private static Pattern rulePattern = Pattern.compile("(.*)\s*=>\s*(.*)\s*$");

This regex matches (and effectively discards) any whitespace (\s*) surrounding the second replacement token ((.*)), so it seems that you cannot use leading or trailing whitespace as part of your replacement mapping (though it could include interstitial whitespace). Even if the regex were different, the matched token is trim()ed, which would have removed any leading and trailing whitespace.

Elasticsearch strange filter behaviour

Answers (1)

Related Questions