Use pretokenized text with lucene

Question

My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01 and use a WhiteSpaceTokenizer to split them again. Is there a better idea? (the input is in XML)

As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).

Fred Foo · Accepted Answer

WhitespaceTokenizer is unfit for strings joined with 0x01. Instead, derive from CharTokenizer, overriding isTokenChar.

The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial TokenStream that just emits the tokens from its input.

If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a PositionIncrementAttribute to control this.

For an example of PositionIncrementAttribute usage, see my lemmatizing TokenStream which emits both word forms found in full text and their lemmas at the same position.

Use pretokenized text with lucene

Answers (2)

Related Questions