Inode exhaustion when building lucene index

Question

Preconditions:

Spring boot 3.1.9
Java 17
spring-kafka
lucene:
- lucene-core/queries/queryparser/join - 9.9.2
- lucene-analyzers-common - 8.11.3
- lucene-analyzers - 3.6.2
On the server on which the import is done is:
- 19G (47%) space avaialble on disk
- 2.3M available inodes left (out of 2.4M)

Explanation how index is build

What I do is I have 7.5 mln messages on topic articles for multiple tenants (in my case there is 67 tenants). Those messages need to be read and stored on the disk as lucene index. Each tenant will have his own directory/index.

This topic, contains articles and updates to articles. Articles are kept in parent-child relationship (blocks, bitsets). Each block, is following the standard by: first are added the children, then as the last element the parent, that block is added to index by calling indexWriter.addDocuments(block). Commits are done in the background by the scheduler though. Basically reading from kafka and saving to index looks like this:

@KafkaListener(topics = "articles")
public void import(@Header("tenant") final String tenant, final Article article) {
    Optional optCurrentIndexItem = indexItemProvider.findParentDocument(article, tenant);

    if (optCurrentIndexItem.isEmpty()) {
        indexItemCreator.create(article, tenan);
    } else {
        DocumentWithLuceneId currentIndexItem = optCurrentIndexItem.get();
        indexItemUpdater.update(currentIndexItem, article, tenant);
    }
}

Where "update" is:

indexWriter.tryDeleteDocument and
indexWriter.addDocuments(newBlock);

Instance of the IndexWriter for each tenant, are kept in the memory:

private final Map indexes = new HashMap<>();

public IndexWriter resolveIndex(final String tenant, final String indexName) {
    synchronized (indexes) {
        if (indexes.containsKey(tenant + indexName)) {
            return indexes.get(tenant + indexName);
        }

        try {
            final IndexWriter writer = new IndexWriter(directory(tenant), config())
            indexes.put(tenant + indexName, writer);
            return writer;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

private IndexWriterConfig config() {
    TieredMergePolicy mergePolicy = new TieredMergePolicy();
    mergePolicy.setNoCFSRatio(1.0);

    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
    indexWriterConfig.setMergePolicy(mergePolicy);
    return indexWriterConfig;
}

commit() operations are done as scheduled job, per tenant, in parallel every 20 seconds (only for index-writers that have pending changes). Locking per tenant (ReentrantLock) is in place.

The problem

The problem I have, is that when importing articles from kafka, after about 1mln records, the inode usage is 100% and disk space usage is 93%. In that moment:

all indexes (for all tenants) together have ~16GB.
there is about 6 indexes (tenants) with number of files in the index around ~300k, 14 with 10-75k files, 22 with 1-8k etc.
One of the tenants, with the highest number of files, contains a lot of .cfe, .cfs, .si files.
As you can see in my config I set NoCFSRatio to 100% to always create compund files. Despite that, after 1 mln I am running out of inodes.

I plaied with different config for TieredMergePolicy (like: SegmentsPerTier(3), MaxMergeAtOnce(100), MaxMergedSegmentMB(2048)) to no avail.

On different machine, where there is more disk space, less tenants (about 10), less articles to import (4.5mln), the entire process works fine.

But I believe there is something I can do to improve the process so it works on the other (weaker) machine as well. What could I do to deal with inode exhaustion during index build up ? any help is welcome

Inode exhaustion when building lucene index

Preconditions:

Explanation how index is build

The problem

Answers (0)

Related Questions