Filip Kowalski
Filip Kowalski

Reputation: 154

Inode exhaustion when building lucene index

Preconditions:

Explanation how index is build

What I do is I have 7.5 mln messages on topic articles for multiple tenants (in my case there is 67 tenants). Those messages need to be read and stored on the disk as lucene index. Each tenant will have his own directory/index.

This topic, contains articles and updates to articles. Articles are kept in parent-child relationship (blocks, bitsets). Each block, is following the standard by: first are added the children, then as the last element the parent, that block is added to index by calling indexWriter.addDocuments(block). Commits are done in the background by the scheduler though. Basically reading from kafka and saving to index looks like this:

@KafkaListener(topics = "articles")
public void import(@Header("tenant") final String tenant, final Article article) {
    Optional<DocumentWithLuceneId> optCurrentIndexItem = indexItemProvider.findParentDocument(article, tenant);

    if (optCurrentIndexItem.isEmpty()) {
        indexItemCreator.create(article, tenan);
    } else {
        DocumentWithLuceneId currentIndexItem = optCurrentIndexItem.get();
        indexItemUpdater.update(currentIndexItem, article, tenant);
    }
}

Where "update" is:

Instance of the IndexWriter for each tenant, are kept in the memory:

private final Map<String, IndexWriter> indexes = new HashMap<>();

public IndexWriter resolveIndex(final String tenant, final String indexName) {
    synchronized (indexes) {
        if (indexes.containsKey(tenant + indexName)) {
            return indexes.get(tenant + indexName);
        }

        try {
            final IndexWriter writer = new IndexWriter(directory(tenant), config())
            indexes.put(tenant + indexName, writer);
            return writer;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

private IndexWriterConfig config() {
    TieredMergePolicy mergePolicy = new TieredMergePolicy();
    mergePolicy.setNoCFSRatio(1.0);

    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
    indexWriterConfig.setMergePolicy(mergePolicy);
    return indexWriterConfig;
}

commit() operations are done as scheduled job, per tenant, in parallel every 20 seconds (only for index-writers that have pending changes). Locking per tenant (ReentrantLock) is in place.


The problem

The problem I have, is that when importing articles from kafka, after about 1mln records, the inode usage is 100% and disk space usage is 93%. In that moment:

I plaied with different config for TieredMergePolicy (like: SegmentsPerTier(3), MaxMergeAtOnce(100), MaxMergedSegmentMB(2048)) to no avail.

On different machine, where there is more disk space, less tenants (about 10), less articles to import (4.5mln), the entire process works fine.

But I believe there is something I can do to improve the process so it works on the other (weaker) machine as well. What could I do to deal with inode exhaustion during index build up ? any help is welcome

Upvotes: 1

Views: 106

Answers (0)

Related Questions