Reputation: 21
I know that there is a flag for each document that Lucene unset it when we delete the document. I need to knoe more details about it, because I want to use it on a lot of indexes and the performance is important to me.
How Lucene finds the flag? How much disk utilization for deletetion?
Upvotes: 1
Views: 73
Reputation: 8657
Lucene flags the deleted document in a file with extension .del
, this file format as: Format,Header,ByteCount,BitCount, Bits | DGaps (depending on Format)
Format is 1: indicates cleared DGaps.
ByteCount indicates the number of bytes in Bits. It is typically (SegSize/8)+1.
BitCount indicates the number of bits that are currently set in Bits.
Bits contains one bit for each document indexed. When the bit corresponding to a document number is cleared, that document is marked as deleted. Bit ordering is from least to most significant. Thus, if Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as alive (not deleted).
DGaps represents sparse bit-vectors more efficiently than Bits. It is made of DGaps on indexes of nonOnes bytes in Bits, and the nonOnes bytes themselves. The number of nonOnes bytes in Bits (NonOnesBytesCount) is not stored.
For example, if there are 8000 bits and only bits 10,12,32 are cleared, DGaps would be used:
(VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1
And lucene I/O operations depends on the IndexWriter
that is opened or how many objects of IndexWriter
are exist, that means if you want to delete or index number of documents, you can save that in one I/O hit, by using writer.commit()
or writer.close()
in case finished your job.
what am trying to say is that Creating and init the IndexWriter
is cost and lucene recommended to use one IndexWriter
object.
Here you can find every thing about lucene.
Upvotes: 1