Lucene: Loading Index files while searching?

Question

Can anyone explain how index files are loaded in memory while searching?

Is the whole file (fnm, tis, fdt etc) loaded at once or in chunks?

How individual segments are loaded and in which order?

How to encrypt Lucene index?

Artur Nowak · Accepted Answer

The main point of having the index segments is that you can rarely load the whole index in the memory.

The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!

So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.

As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.

[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

Lucene: Loading Index files while searching?

Answers (1)

Related Questions