WilhelmM
WilhelmM

Reputation: 11

Understanding Segement Merge by Lucene

We have the following Scenario:

  1. Elastic Search is built on Lucene.
  2. Index baseline of 14 million documents (Batch Indexed)
  3. Each week about 20 thousands documents get deleted and also about 30 thousands of the documents get reindexed or updated. Indexing happens in batches of 2000 documents via the Bulk-API.

At first we handle the deletion of the documents and afterwards the update appears. FYI, it can happen, that we delete a document which will be indexed again some minutes by the updater again.

My Question now: If ES marks a document (ID:D123) as deleted in a segment (lets say A), but afterwards a document with the same ID (ID:D123) gets indexed into another segment (B), the document should be searchable. BUT, what happens if the segment merge occurs?

Segment B will be merged into Segment A which contains the delete flag for the same document ID (ID:D123).

After the merge, does the document still have the delete flag? I know, if a segment gets merged the deleted documents are not merged. But, does it matter which way around the merge happens? Segment A into B or B into A?

We lose some documents with this scenario and still cannot find out why.

For a short term solution, I filter out the documents to be deleted after reindexing.

I'd like to understand the whole process. It seems doesn't consistent at all!

Thanks

Upvotes: 1

Views: 3257

Answers (1)

eribeiro
eribeiro

Reputation: 592

Lucene's segment merging is the creation of a new segment with the content of previous segments, but without deleted or outdated documents. So, using your example, it will be created a new segment C with the content from segments A and B, in this order but filtering out the deleted documents of the new segment. Also, each commit creates a new segment and they have generations (1, 2, ...). Therefore, each segment is a snapshot of a time interval between commits and it doesn't make sense to first read B and then A during merge because inserts + deletes of same document are not commutative, and we would be going "backwards" in time. Therefore, you effectively updated document ID:D123 by deleting and inserting a new document with same ID. There is no really update in Lucene's indexes: it is a delete followed by an insert.

Upvotes: 0

Related Questions