Reputation: 43
I am new to Lucene and I would really appreciate an example on how to have bigrams and trigrams tokens in the index.
I'm using the following code and I have modified it to be able to calculate the term frequencies and weight but I need to do that to bigrams and trigrams also. I can't see the tokenization part! I searched online and some of the suggested classes do not exist in Lucene 3.4.0 as they have been deprecated.
Any suggestions please?
Thanks, Moe
EDIT: --------------------------------
Now I'm using the NGramTokenFilter as mbonaci suggested. This is part of the code where I Tokenize a text to get the uni, bi and trigrams. But it's being done on a character rather than word level.
Instead of:
[H][e][l][l][o][HE][EL]
etc.
I'm looking for: [Hello][World][Hello World]
int min =1;
int max =3;
WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_34);
String text ="hello my world";
TokenStream tokenStream = analyzer.tokenStream("Data", new StringReader(text));
NGramTokenFilter myfilter = new NGramTokenFilter(tokenStream,min,max);
OffsetAttribute offsetAttribute2 = myfilter.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute2 = myfilter.addAttribute(CharTermAttribute.class)
while (myfilter.incrementToken()) {
int startOffset = offsetAttribute2.startOffset();
int endOffset = offsetAttribute2.endOffset();
String term = charTermAttribute2.toString();
System.out.println(term);
};
Upvotes: 4
Views: 5374
Reputation: 15771
you need to look at shingles. That article shows how to do it.
Upvotes: 1
Reputation: 5708
Take a look at org.apache.lucene.analysis.ngram.NGramTokenFilter.
Here is the source.
Upvotes: 0