Adi Gabaie
Adi Gabaie

Reputation: 146

Why are position, end_offset, start_offset messed up when using self made Tokenizer?

I wrote my own tokenizer: https://github.com/AdiGabaie/tokenizer

I create an analyzer with this tokenizer.

When I test the analyzer, I see the tokens and the "start_offset" and "end_offset" are 0 for all tokens And position is 1 for all.

If I remove the 'autocomplete_filter' the position is ok (1,2,3...) but 'start_offset' and 'end_offset' are still 0.

I guess I should do something in my tokenizer implementation to fix it?

PUT /aditryings/
{
    "settings": {
        "index" : {
            "analysis" : { 
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "phrase_tokenizer",
                        "filter" : ["lowercase","autocomplete_filter"]
                    }
                },
                "filter" : {
                    "autocomplete_filter": {
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 20
                    }
                }
            }
        }
    }, 
    "mappings" : {
        "productes" : {
            "properties" : {
                "id" : { "type" : "long"},
                "productName" : { "type" : "string", "index" : "analyzed", "analyzer": "my_analyzer"}
            }
        }
    }
}

Upvotes: 0

Views: 537

Answers (1)

Jaap
Jaap

Reputation: 724

The output of your tokenizer implementation is in the form of the values of the attributes that are added, such as in your tokenizer implementation:

protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);

This is the one attribute used in your code, but Elasticsearch is expecting not only the attribute representing the token, but also start_offset, end_offset and position. By adding and setting the value for OffsetAttribute, you can correctly set the start and end offsets for your token:

https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html

Similarly, PositionIncrementAttribute is used for setting the position:

https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html

its contract is described in the Javadoc and apparently 0 is a valid value, for example to be used when a word has multiple stems.

For some inspiration you can take a look at the standard tokenizer implementation, which uses all three types of attributes (as well as a token type attribute):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

Upvotes: 0

Related Questions