Adam Benson
Adam Benson

Reputation: 8562

Mapping search analyzer (with apostrophes) not working

This question is based on the "Tidying up Punctuation" section at https://www.elastic.co/guide/en/elasticsearch/guide/current/char-filters.html

Specifically that this:

  "char_filter": { 
    "quotes": {
      "type": "mapping",
      "mappings": [ 
        "\\u0091=>\\u0027",
        "\\u0092=>\\u0027",
        "\\u2018=>\\u0027",
        "\\u2019=>\\u0027",
        "\\u201B=>\\u0027"
      ]
    }

will turn "weird" apostrophes into a normal one.

But it doesn't seem to work.

I create this index:

{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "analysis": {
        "char_filter": {
          "char_filter_quotes": {
            "type": "mapping",
            "mappings": [
              "\\u0091=>\\u0027",
              "\\u0092=>\\u0027",
              "\\u2018=>\\u0027",
              "\\u2019=>\\u0027",
              "\\u201B=>\\u0027"
            ]
          }
        },
        "analyzer": {
          "analyzer_Text": {
            "type": "standard",
            "char_filter": [ "char_filter_quotes" ]
          }
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "Text": {
          "type": "text",
          "analyzer": "analyzer_Text",
          "search_analyzer": "analyzer_Text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

Add this document:

{
  "Text": "Fred's Jim‘s Pete’s Mark‘s"
}

Run this search and get a hit (on "Fred's" with "Fred's" highlighted):

{
    "query":
    {
        "match":
        {
            "Text": "Fred's"
        }
    },
    "highlight":
    {
        "fragment_size": 200,
        "pre_tags": [ "<span class='search-hit'>" ],
        "post_tags": [ "</span>" ],
        "fields": { "Text": { "type": "fvh" } }
    }
}

If I change the above search like this:

    "Text": "Fred‘s"

I get no hits. Why not? I thought the search_analyzer would turn the "Fred‘s" into "Fred's" which should hit. Also, if I search on

    "Text": "Mark's"

I get nothing but

    "Text": "Mark‘s"

does hit. The whole point of the exercise was to keep apostrophes but allow for the fact that, occasionally, non-standard apostrophes slip through and still get a hit.

Even more confusingly if I analyze this at http://127.0.0.1:9200/esidx_json_gs_entry/_analyze:

{
    "char_filter": [ "char_filter_quotes" ],
    "tokenizer" : "standard",
    "filter" : [ "lowercase" ],
    "text" : "Fred's Jim‘s Pete’s Mark‛s"
}

I get exactly what I would expect:

{
    "tokens": [
        {
            "token": "fred's",
            "start_offset": 0,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "jim's",
            "start_offset": 7,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "pete's",
            "start_offset": 13,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "mark's",
            "start_offset": 20,
            "end_offset": 26,
            "type": "<ALPHANUM>",
            "position": 3
        }
    ]
}

In the search, the search analyzer appears to do nothing. What am I missing?

TVMIA,

Adam (Editors - yes I know that saying "Thank you" is "fluff" but I wish to be polite so please leave it in.)

Upvotes: 0

Views: 80

Answers (1)

jay
jay

Reputation: 2077

There is a small mistake in your analyzer. It should be

"tokenizer": "standard"

Not

"type": "standard"

also once you have indexed a document, you can check the actual terms by using _termvectors So in your example you can do a GET on

http://127.0.0.1:9200/esidx_json_gs_entry/_doc/1/_termvectors 

Upvotes: 1

Related Questions