Reputation: 8562
This question is based on the "Tidying up Punctuation" section at https://www.elastic.co/guide/en/elasticsearch/guide/current/char-filters.html
Specifically that this:
"char_filter": {
"quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
will turn "weird" apostrophes into a normal one.
But it doesn't seem to work.
I create this index:
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"char_filter": {
"char_filter_quotes": {
"type": "mapping",
"mappings": [
"\\u0091=>\\u0027",
"\\u0092=>\\u0027",
"\\u2018=>\\u0027",
"\\u2019=>\\u0027",
"\\u201B=>\\u0027"
]
}
},
"analyzer": {
"analyzer_Text": {
"type": "standard",
"char_filter": [ "char_filter_quotes" ]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"Text": {
"type": "text",
"analyzer": "analyzer_Text",
"search_analyzer": "analyzer_Text",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Add this document:
{
"Text": "Fred's Jim‘s Pete’s Mark‘s"
}
Run this search and get a hit (on "Fred's" with "Fred's" highlighted):
{
"query":
{
"match":
{
"Text": "Fred's"
}
},
"highlight":
{
"fragment_size": 200,
"pre_tags": [ "<span class='search-hit'>" ],
"post_tags": [ "</span>" ],
"fields": { "Text": { "type": "fvh" } }
}
}
If I change the above search like this:
"Text": "Fred‘s"
I get no hits. Why not? I thought the search_analyzer would turn the "Fred‘s" into "Fred's" which should hit. Also, if I search on
"Text": "Mark's"
I get nothing but
"Text": "Mark‘s"
does hit. The whole point of the exercise was to keep apostrophes but allow for the fact that, occasionally, non-standard apostrophes slip through and still get a hit.
Even more confusingly if I analyze this at http://127.0.0.1:9200/esidx_json_gs_entry/_analyze:
{
"char_filter": [ "char_filter_quotes" ],
"tokenizer" : "standard",
"filter" : [ "lowercase" ],
"text" : "Fred's Jim‘s Pete’s Mark‛s"
}
I get exactly what I would expect:
{
"tokens": [
{
"token": "fred's",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "jim's",
"start_offset": 7,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pete's",
"start_offset": 13,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "mark's",
"start_offset": 20,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 3
}
]
}
In the search, the search analyzer appears to do nothing. What am I missing?
TVMIA,
Adam (Editors - yes I know that saying "Thank you" is "fluff" but I wish to be polite so please leave it in.)
Upvotes: 0
Views: 80
Reputation: 2077
There is a small mistake in your analyzer. It should be
"tokenizer": "standard"
Not
"type": "standard"
also once you have indexed a document, you can check the actual terms by using _termvectors So in your example you can do a GET on
http://127.0.0.1:9200/esidx_json_gs_entry/_doc/1/_termvectors
Upvotes: 1