How to create a bigram/trigrams index in Lucene 3.4.0?

Question

I am new to Lucene and I would really appreciate an example on how to have bigrams and trigrams tokens in the index.

I'm using the following code and I have modified it to be able to calculate the term frequencies and weight but I need to do that to bigrams and trigrams also. I can't see the tokenization part! I searched online and some of the suggested classes do not exist in Lucene 3.4.0 as they have been deprecated.

Any suggestions please?

Thanks, Moe

EDIT: --------------------------------

Now I'm using the NGramTokenFilter as mbonaci suggested. This is part of the code where I Tokenize a text to get the uni, bi and trigrams. But it's being done on a character rather than word level.

Instead of: [H][e][l][l][o][HE][EL] etc.

I'm looking for: [Hello][World][Hello World]

        int min =1;
        int max =3;
        WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_34);
        String text ="hello my world";
        TokenStream tokenStream = analyzer.tokenStream("Data", new StringReader(text));


        NGramTokenFilter myfilter = new NGramTokenFilter(tokenStream,min,max);
        OffsetAttribute offsetAttribute2 = myfilter.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttribute2 = myfilter.addAttribute(CharTermAttribute.class)
        while (myfilter.incrementToken()) {
            int startOffset = offsetAttribute2.startOffset();
            int endOffset = offsetAttribute2.endOffset();
            String term = charTermAttribute2.toString();
            System.out.println(term);
        };

Persimmonium · Accepted Answer

you need to look at shingles. That article shows how to do it.

How to create a bigram/trigrams index in Lucene 3.4.0?

Answers (2)

Related Questions