Siddharth Sinha
Siddharth Sinha

Reputation: 628

How to get the postings list for each term in lucene index

I am reading a lucene index and I am able to retrieve the terms from the index. I want to get all the postings list for each term in lucene index. I am using lucene 7.4.0 jar. Each document in this index consists of two fields (1) text_es or text_fr or text_en (2) DocId.Below is the code.

public class LuceneTest {

public static void main(String[] args) {
    final String INDEX_DIRECTORY = "./index";
    Directory index;
        try {

            index = FSDirectory.open(Paths.get(INDEX_DIRECTORY));
            IndexReader indexReader = DirectoryReader.open(index);

            LeafReaderContext leafReaderContext_es = indexReader.leaves().get(0);
            LeafReaderContext leafReaderContext_fr = indexReader.leaves().get(1);
            LeafReaderContext leafReaderContext_en = indexReader.leaves().get(2);

            LinkedList<String> terms_es = new LinkedList<>();
            LinkedList<String> terms_en = new LinkedList<>();
            LinkedList<String> terms_fr = new LinkedList<>();

            LeafReader ir_es = leafReaderContext_es.reader();
            LeafReader ir_fr = leafReaderContext_fr.reader();
            LeafReader ir_en = leafReaderContext_en.reader();

            TermsEnum terms = ir_es.terms("text_es").iterator();
            BytesRef next = terms.next();
            while (next != null){
                terms_es.add(terms.term().utf8ToString());
                next = terms.next();
            }

            TermsEnum termsEnum_fr = ir_fr.terms("text_fr").iterator();
            BytesRef next_fr = termsEnum_fr.next();
            while (next_fr != null){
                terms_fr.add(termsEnum_fr.term().utf8ToString());
                next_fr = termsEnum_fr.next();
            }

            TermsEnum termsEnum_en = ir_en.terms("text_en").iterator();
            BytesRef next_en = termsEnum_en.next();
            while (next_en != null){
                terms_en.add(termsEnum_en.term().utf8ToString());
                next_en = termsEnum_en.next();
            }

            System.out.println("Espanish terms are as follows:");
            System.out.println(terms_es);

            System.out.println("French terms are as follows:");
            System.out.println(terms_fr);

            System.out.println("English terms are as follows:");
            System.out.println(terms_en);


        } catch (IOException e) {
            e.printStackTrace();
        }
}

I went through the documentation of lucene 7.4.0 and came across the method postings(Term term) which returns PostingsEnum for the specified term with PostingsEnum.FREQS. The problem is that this method accepts parameter term of class Term but I am getting TermsEnum. How can convert this to Term class so that I can use the method postings to retrieve the corresponding postings list for each term.

Thanks.

Upvotes: 3

Views: 1358

Answers (1)

NiYanchun
NiYanchun

Reputation: 793

I use lucene 8.2, you may try code below:

    IndexReader indexReader = DirectoryReader.open(indexDir);
    Terms termVector = indexReader.getTermVector(0, "content");
    TermsEnum termIter = termVector.iterator();
    while (termIter.next() != null) {
        PostingsEnum postingsEnum = termIter.postings(null, PostingsEnum.ALL);
        while (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
            int freq = postingsEnum.freq();
            System.out.printf("term: %s, freq: %d,", termIter.term().utf8ToString(), freq);
            while (freq > 0) {
                System.out.printf(" nextPosition: %d,", postingsEnum.nextPosition());
                System.out.printf(" startOffset: %d, endOffset: %d",
                        postingsEnum.startOffset(), postingsEnum.endOffset());
                freq--;
            }
            System.out.println();
        }
    }

Upvotes: 2

Related Questions