Reputation: 163
I am new to Lucene world, and don't have much working knowledge of the subject. I need to extract document term vector and I found the following code online How to extract Document Term Vector in Lucene 3.5.0.
/**
* Sums the term frequency vector of each document into a single term frequency map
* @param indexReader the index reader, the document numbers are specific to this reader
* @param docNumbers document numbers to retrieve frequency vectors from
* @param fieldNames field names to retrieve frequency vectors from
* @param stopWords terms to ignore
* @return a map of each term to its frequency
* @throws IOException
*/
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException {
Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);
for (Integer docNum : docNumbers) {
for (String fieldName : fieldNames) {
TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
if (tfv == null) {
// ignore empty fields
continue;
}
String terms[] = tfv.getTerms();
int termCount = terms.length;
int freqs[] = tfv.getTermFrequencies();
for (int t=0; t < termCount; t++) {
String term = terms[t];
int freq = freqs[t];
// filter out single-letter words and stop words
if (StringUtils.length(term) < 2 ||
stopWords.contains(term)) {
continue; // stop
}
Integer totalFreq = totalTfv.get(term);
totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
totalTfv.put(term, totalFreq);
}
}
}
return totalTfv;
}
I have created the index which resides in the following directory.
String indexDir = "C:\\Lucene\\Output\\";
Directory dir = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(dir);
My problem is that I do not know how to get the doc ids (List docNumbers) which is required for the above mentioned function. I have tried a couple of methods like
TermDocs docs = reader.termDocs();
but it did not work.
Upvotes: 0
Views: 3383
Reputation: 12412
Lucene starts assigning ids from zero, and maxDoc() is the upper limit, so you can simply loop to get all ids, skipping deleted documents (Lucene marks them for deletion when you call deleteDocument):
for (int docNum=0; docNum < reader.maxDoc(); docNum++) {
if (reader.isDeleted(docNum)) {
continue;
}
TermFreqVector tfv = reader.getTermFreqVector(docNum, "fieldName");
...
}
For this to work, you have to enable them during indexing, see Field.TermVector.
Upvotes: 2