Reputation: 146
I wrote my own tokenizer: https://github.com/AdiGabaie/tokenizer
I create an analyzer with this tokenizer.
When I test the analyzer, I see the tokens and the "start_offset" and "end_offset" are 0 for all tokens And position is 1 for all.
If I remove the 'autocomplete_filter' the position is ok (1,2,3...) but 'start_offset' and 'end_offset' are still 0.
I guess I should do something in my tokenizer implementation to fix it?
PUT /aditryings/
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "phrase_tokenizer",
"filter" : ["lowercase","autocomplete_filter"]
}
},
"filter" : {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
},
"mappings" : {
"productes" : {
"properties" : {
"id" : { "type" : "long"},
"productName" : { "type" : "string", "index" : "analyzed", "analyzer": "my_analyzer"}
}
}
}
}
Upvotes: 0
Views: 537
Reputation: 724
The output of your tokenizer implementation is in the form of the values of the attributes that are added, such as in your tokenizer implementation:
protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
This is the one attribute used in your code, but Elasticsearch is expecting not only the attribute representing the token, but also start_offset, end_offset and position. By adding and setting the value for OffsetAttribute, you can correctly set the start and end offsets for your token:
Similarly, PositionIncrementAttribute is used for setting the position:
its contract is described in the Javadoc and apparently 0
is a valid value, for example to be used when a word has multiple stems.
For some inspiration you can take a look at the standard tokenizer implementation, which uses all three types of attributes (as well as a token type attribute):
Upvotes: 0