Reputation: 11
I can't get this to work with Lucene 4.0 and its new features... Could somebody please help me??
I have crawled a bunch of html-documents from the web. Now I would like to count the number of distinct words of every Document.
This is how I did it with Lucene 3.5 (for a single document. To get them all I loop over all documents... every time with a new RAMDirectory containing only one doc) :
Analyzer analyzer = some Lucene Analyzer;
RAMDirectory index;
index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
String _words = new String();
// get somehow the String containing a certain text:
_words = doc.getPageDescription();
try {
IndexWriter w = new IndexWriter(index, config);
addDoc(w, _words);
w.close();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
try {
// System.out.print(", count Terms... ");
IndexReader reader = IndexReader.open(index);
TermFreqVector[] freqVector = reader.getTermFreqVectors(0);
if (freqVector == null) {
System.out.println("Count words: ": 0");
}
for (TermFreqVector vector : freqVector) {
String[] terms = vector.getTerms();
int[] freq = vector.getTermFrequencies();
int n = terms.length;
System.out.println("Count words: " + n);
....
How can I do this with Lucene 4.0?
I'd prefer to do this using a FSDirectory instead of RAMDirectory however; I guess this is more performant if I have a quite high number of documents?
Thanks and regards C.
Upvotes: 1
Views: 3584
Reputation: 3195
Use the Fields/Terms apis.
See especially the example 'access term vector fields for a specific document'
Seeing as you are looping over all documents, if your end goal is really something like the average number of unique terms across all documents, keep reading to the 'index statistics section'. For example in that case, you can compute that efficiently with #postings / #documents: getSumDocFreq()/maxDoc()
Upvotes: 1