How to get Document ids for Document Term Vector in Lucene

Question

I am new to Lucene world, and don't have much working knowledge of the subject. I need to extract document term vector and I found the following code online How to extract Document Term Vector in Lucene 3.5.0.

 /**
 * Sums the term frequency vector of each document into a single term frequency map
 * @param indexReader the index reader, the document numbers are specific to this reader
 * @param docNumbers document numbers to retrieve frequency vectors from
 * @param fieldNames field names to retrieve frequency vectors from
 * @param stopWords terms to ignore
 * @return a map of each term to its frequency
 * @throws IOException
 */
private Map getTermFrequencyMap(IndexReader indexReader, List docNumbers, String[] fieldNames, Set stopWords)
throws IOException {
    Map totalTfv = new HashMap(1024);

    for (Integer docNum : docNumbers) {
        for (String fieldName : fieldNames) {
            TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
            if (tfv == null) {
                // ignore empty fields
                continue;
            }

            String terms[] = tfv.getTerms();
            int termCount = terms.length;
            int freqs[] = tfv.getTermFrequencies();

            for (int t=0; t < termCount; t++) {
                String term = terms[t];
                int freq = freqs[t];

                // filter out single-letter words and stop words
                if (StringUtils.length(term) < 2 ||
                    stopWords.contains(term)) {
                    continue; // stop
                }

                Integer totalFreq = totalTfv.get(term);
                totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
                totalTfv.put(term, totalFreq);
            }
        }
    }

    return totalTfv;
}

I have created the index which resides in the following directory.

String indexDir = "C:\Lucene\Output\";
Directory dir = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(dir);

My problem is that I do not know how to get the doc ids (List docNumbers) which is required for the above mentioned function. I have tried a couple of methods like

TermDocs docs = reader.termDocs();

but it did not work.

milan · Accepted Answer

Lucene starts assigning ids from zero, and maxDoc() is the upper limit, so you can simply loop to get all ids, skipping deleted documents (Lucene marks them for deletion when you call deleteDocument):

for (int docNum=0; docNum < reader.maxDoc(); docNum++) {
    if (reader.isDeleted(docNum)) {
        continue;
    }
    TermFreqVector tfv = reader.getTermFreqVector(docNum, "fieldName");
    ...
}

For this to work, you have to enable them during indexing, see Field.TermVector.

How to get Document ids for Document Term Vector in Lucene

Answers (1)

Related Questions