Reputation: 21700
My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01
and use a WhiteSpaceTokenizer
to split them again. Is there a better idea? (the input is in XML)
As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).
Upvotes: 0
Views: 258
Reputation: 698
Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):
public class MyTokenStream extends TokenStream {
CharTermAttribute charTermAtt;
OffsetAttribute offsetAtt;
final Iterator<MyToken> listOfTokens;
MyTokenStream(Iterator<MyToken> tokenList) {
listOfTokens = tokenList;
charTermAtt = addAttribute(CharTermAttribute.class);
offsetAtt = addAttribute(OffsetAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if(listOfTokens.hasNext()) {
super.clearAttributes();
MyToken myToken = listOfTokens.next();
charTermAtt.setLength(0);
charTermAtt.append(myToken.getText());
offsetAtt.setOffset(myToken.begin(), myToken.end());
return true;
}
return false;
}
}
Upvotes: 3
Reputation: 363567
WhitespaceTokenizer
is unfit for strings joined with 0x01
. Instead, derive from CharTokenizer
, overriding isTokenChar
.
The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial TokenStream
that just emits the tokens from its input.
If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a PositionIncrementAttribute
to control this.
For an example of PositionIncrementAttribute
usage, see my lemmatizing TokenStream
which emits both word forms found in full text and their lemmas at the same position.
Upvotes: 0