Young
Young

Reputation: 81

How to get the term frequency of specific term in each doc with Lucene5.3?

Imagine there are three documents. Doc1: Hi everyone, I am Li. Hi. Doc2: Well done boy. Doc3: Hi, boy. I am Young.

I try to get the term frequency of each term in each document with Lucene5.3.

The result I want to get: Doc1: Hi 2 everyone 1 I 1 am 1 Li 1

1   IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(iNDEX_DIR2).toPath()));

2   int num_doc = reader.numDocs();
3   for(int docNum=0; docNum<num_doc; docNum++){
4       try{

5           Document doc = reader.document(docNum);
6           System.out.println("Processing file:"+doc.get("filename"));
7           
8           Terms termVector = reader.getTermVector(docNum, "contents");
9           TermsEnum itr = termVector.iterator();
10          BytesRef term = null;
11                          
12          while((term = itr.next()) != null){
13              try{
14                  String termText = term.utf8ToString();
15                  Term termInstance = new Term("contents",term);      
16                  long termFreq = reader.totalTermFreq(termInstance);
17                  long docCount = reader.docFreq(termInstance);
18                  
19                  System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
20              }catch(Exception e){
21                  System.out.println(e);
22              }
23          }

I get the total term frequency of specific term. Could anyone to help me?

Thanks!

Upvotes: 3

Views: 1966

Answers (1)

Karsten R.
Karsten R.

Reputation: 1758

Use PostingEnum.freq().

In your case you have an index with TermVector so use the following (after line 7):

8           Terms termVector = reader.getTermVector(docNum, "contents");
9           TermsEnum itr = termVector.iterator();
10          BytesRef term = null;
11          PostingsEnum postings = null;
12          while((term = itr.next()) != null){
13              try{
14                  String termText = term.utf8ToString();
15                  postings = itr.postings(postings, PostingsEnum.FREQS);
16                  int freq = postings.freq();
17
18
19                  System.out.println("doc:" + docNum + ", term: " + termText + ", termFreq = " + freq);
20              } catch(Exception e){
21                  System.out.println(e);
22              }
23          }

(if you need the frequency for all documents: be aware that you can reach PostingEnum without TermVector).

Upvotes: 2

Related Questions