How to keep Lucene index without deleted documents

Question

This is my first question on Stack Overflow,so wish me luck.

I am doing a classification process over a Lucene index with java and i need to update a document field named category. I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. So, how can I actually erase those deleted documents from index?

Here is some snippets of my code:

public static void main(String[] args) throws IOException, ParseException {
    ///////////////////////Preparing config data////////////////////////////
    File indexDir = new File("/indexDir");
    Directory fsDir = FSDirectory.open(indexDir);

    IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
    iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
    IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);

    IndexReader reader = DirectoryReader.open(fsDir);
    IndexSearcher indexSearcher = new IndexSearcher(reader);
    KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
    AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);

    classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());

    System.out.println("***Before***");
    showIndexedDocuments(reader);
    System.out.println("***Before***");

    int maxdoc = reader.maxDoc();
    int j = 0;
    for (int i = 0; i < maxdoc; i++) {
        Document doc = reader.document(i);
        String clusterClasif = doc.get("category");
        String text = doc.get("text");
        String docid = doc.get("doc_id");
        ClassificationResult result = classifier.assignClass(text);
        String classified = result.getAssignedClass().utf8ToString();

        if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
            Term term = new Term("doc_id", docid);
            doc.removeField("category");
            doc.add(new StringField("category",
                    classified, Field.Store.YES));
            indexWriter.updateDocument(term,doc);
            j++;
        }
    }
    indexWriter.forceMergeDeletes(true);
    indexWriter.close();
    System.out.println("Classified documents count: " + j);        
    System.out.println();
    reader.close();

    reader = DirectoryReader.open(fsDir);
    System.out.println("Deleted docs: " + reader.numDeletedDocs());
    System.out.println("***After***");
    showIndexedDocuments(reader);
}

private static void showIndexedDocuments(IndexReader reader) throws IOException {
    int maxdoc = reader.maxDoc();
    for (int i = 0; i < maxdoc; i++) {
        Document doc = reader.document(i);
        String idDoc = doc.get("doc_id");
        String text = doc.get("text");
        String category = doc.get("category");

        System.out.println("Id Doc: " + idDoc);
        System.out.println("Category: " + category);
        System.out.println("Text: " + text);
        System.out.println();
    }
    System.out.println("Total: " + maxdoc);
}

I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. If you know how to do that I really appreciate if you share it.

femtoRgon · Accepted Answer

You can check is a document is deleted is the MultiFields class, like:

Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...

So, working this into your code, perhaps something like:

int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
    if (!liveDocs.get(docID)) continue;
    Document doc = reader.document(i);
    String idDoc = doc.get("doc_id");
    ....
}

By the way, sounds like you have previously been working with 3.X, and are now on 4.X. The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them.

How to keep Lucene index without deleted documents

Answers (1)

Related Questions