Reputation: 81
Imagine there are three documents. Doc1: Hi everyone, I am Li. Hi. Doc2: Well done boy. Doc3: Hi, boy. I am Young.
I try to get the term frequency of each term in each document with Lucene5.3.
The result I want to get: Doc1: Hi 2 everyone 1 I 1 am 1 Li 1
1 IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(iNDEX_DIR2).toPath()));
2 int num_doc = reader.numDocs();
3 for(int docNum=0; docNum<num_doc; docNum++){
4 try{
5 Document doc = reader.document(docNum);
6 System.out.println("Processing file:"+doc.get("filename"));
7
8 Terms termVector = reader.getTermVector(docNum, "contents");
9 TermsEnum itr = termVector.iterator();
10 BytesRef term = null;
11
12 while((term = itr.next()) != null){
13 try{
14 String termText = term.utf8ToString();
15 Term termInstance = new Term("contents",term);
16 long termFreq = reader.totalTermFreq(termInstance);
17 long docCount = reader.docFreq(termInstance);
18
19 System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
20 }catch(Exception e){
21 System.out.println(e);
22 }
23 }
I get the total term frequency of specific term. Could anyone to help me?
Thanks!
Upvotes: 3
Views: 1966
Reputation: 1758
Use PostingEnum.freq().
In your case you have an index with TermVector so use the following (after line 7):
8 Terms termVector = reader.getTermVector(docNum, "contents");
9 TermsEnum itr = termVector.iterator();
10 BytesRef term = null;
11 PostingsEnum postings = null;
12 while((term = itr.next()) != null){
13 try{
14 String termText = term.utf8ToString();
15 postings = itr.postings(postings, PostingsEnum.FREQS);
16 int freq = postings.freq();
17
18
19 System.out.println("doc:" + docNum + ", term: " + termText + ", termFreq = " + freq);
20 } catch(Exception e){
21 System.out.println(e);
22 }
23 }
(if you need the frequency for all documents: be aware that you can reach PostingEnum without TermVector).
Upvotes: 2