Lucene: documents are not removed from index if term contains alphanumeric value

Question

I have existing index, adding new and searching documents works fine. However updating and deleting existing documents does not work if deletion term has alphanumeric value (ABC123 or ABC), with numeric values everithing works. I'm using Lucene 8.11.2 and Java8. I'm using StandardAnalyzer. Below is my simplified code

public class MyDirectory {
  
  @Getter
  private Directory index;
  @Getter
  private IndexWriter writer;

  public MyDirectory (String indexPath) {
    index = FSDirectory.open(Paths.get(indexPath))
  }

  public void addNewDocument() {
    try {
      openWriter();

      Document doc = new Document();
      doc.add(new TextField("ID", "ABC123", Field.Store.YES));
      getWriter().addDocument(doc);

      closeWriter();
    } catch (Exception e) {
    } 
  }

  pubic void updateDocument() {
    try {
      openWriter();

      Term delTerm = new Term("ID", "ABC123");
      List docs = new ArrayList<>();
      Document doc = new Document();
      doc.add(new TextField("ID", "ABC123", Field.Store.YES));
      doc.add(new TextField("NAME", "test", Field.Store.YES));
      docs.add(doc);

      // Adds second document with id ABC123 and name 'test' to Index. 
      // I'm expecting here that old document with id ABC123 will removed.
      // If I have 123 as an ID (only numbers) then it works
      getWriter().updateDocuments(delTerm, docs);
      closeWriter();
    } catch (Exception e) {
    }
  }

  private void openWriter() throws IOException {
    writer = new IndexWriter(getIndex(), new IndexWriterConfig(getPerFieldAnalyzer()));
  }


  private PerFieldAnalyzerWrapper getPerFieldAnalyzer() {
    return new PerFieldAnalyzerWrapper(new StandardAnalyzer());
  }

  private void closeWriter() {
    try {
      getWriter().close();

    } catch (IOException e) {
    }
  }
}

Do I need to use diferent analyzer for that field?

Teodor Mysko · Accepted Answer

After some investigation, I figured out that Term does not tokenize input text and as a result, deletion was not performed because the ID field was added to the document with TextField and thus tokenized. So, I've changed TextField to StringField which does not perform tokenization, and then update/delete worked as expected. However, in this case, regular search by ID does not work, so I ended up having two ID fields in the index: one tokenized for external search and another one that is not tokenized for internal use.

Also, another solution for an update was to use Query with deleteDocuments() method and then add new documents:

BooleanQuery.Builder querybuilder = new BooleanQuery.Builder();
QueryParser queryParser = new QueryParser("ID", getPerFieldAnalyzer());
querybuilder.add(queryParser.parse("ABC123"), 
BooleanClause.Occur.FILTER);
getWriter().deleteDocuments(querybuilder.build());

List docs = new ArrayList<>();
Document doc = new Document();
doc.add(new TextField("ID", "ABC123", Field.Store.YES));
doc.add(new TextField("NAME", "test", Field.Store.YES));
docs.add(doc);
getWriter().addDocuments(docs);

Lucene: documents are not removed from index if term contains alphanumeric value

Answers (1)

Related Questions