Reactormonk
Reactormonk

Reputation: 21700

Use pretokenized text with lucene

My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01 and use a WhiteSpaceTokenizer to split them again. Is there a better idea? (the input is in XML)

As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).

Upvotes: 0

Views: 258

Answers (2)

Alex Nevidomsky
Alex Nevidomsky

Reputation: 698

Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):

public class MyTokenStream extends TokenStream {
    CharTermAttribute charTermAtt;
    OffsetAttribute offsetAtt;

    final Iterator<MyToken> listOfTokens;

    MyTokenStream(Iterator<MyToken> tokenList) {
        listOfTokens = tokenList;
        charTermAtt = addAttribute(CharTermAttribute.class);
        offsetAtt = addAttribute(OffsetAttribute.class);

    }

    @Override
    public boolean incrementToken() throws IOException {
        if(listOfTokens.hasNext()) {
            super.clearAttributes();
            MyToken myToken = listOfTokens.next();
            charTermAtt.setLength(0);
            charTermAtt.append(myToken.getText());
            offsetAtt.setOffset(myToken.begin(), myToken.end());
            return true;
        }
        return false;
    }
}

Upvotes: 3

Fred Foo
Fred Foo

Reputation: 363567

WhitespaceTokenizer is unfit for strings joined with 0x01. Instead, derive from CharTokenizer, overriding isTokenChar.

The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial TokenStream that just emits the tokens from its input.

If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a PositionIncrementAttribute to control this.

For an example of PositionIncrementAttribute usage, see my lemmatizing TokenStream which emits both word forms found in full text and their lemmas at the same position.

Upvotes: 0

Related Questions