Reputation: 154
3.1.9
17
spring-kafka
9.9.2
8.11.3
3.6.2
19G
(47%) space avaialble on disk2.3M
available inodes left (out of 2.4M)What I do is I have 7.5
mln messages on topic articles
for multiple tenants (in my case there is 67
tenants). Those messages need to be read and stored on the disk as lucene index. Each tenant will have his own directory/index.
This topic, contains articles and updates to articles. Articles are kept in parent-child relationship (blocks, bitsets). Each block, is following the standard by: first are added the children, then as the last element the parent, that block is added to index by calling indexWriter.addDocuments(block)
. Commits are done in the background by the scheduler though. Basically reading from kafka and saving to index looks like this:
@KafkaListener(topics = "articles")
public void import(@Header("tenant") final String tenant, final Article article) {
Optional<DocumentWithLuceneId> optCurrentIndexItem = indexItemProvider.findParentDocument(article, tenant);
if (optCurrentIndexItem.isEmpty()) {
indexItemCreator.create(article, tenan);
} else {
DocumentWithLuceneId currentIndexItem = optCurrentIndexItem.get();
indexItemUpdater.update(currentIndexItem, article, tenant);
}
}
Where "update" is:
indexWriter.tryDeleteDocument
andindexWriter.addDocuments(newBlock);
Instance of the IndexWriter
for each tenant, are kept in the memory:
private final Map<String, IndexWriter> indexes = new HashMap<>();
public IndexWriter resolveIndex(final String tenant, final String indexName) {
synchronized (indexes) {
if (indexes.containsKey(tenant + indexName)) {
return indexes.get(tenant + indexName);
}
try {
final IndexWriter writer = new IndexWriter(directory(tenant), config())
indexes.put(tenant + indexName, writer);
return writer;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
private IndexWriterConfig config() {
TieredMergePolicy mergePolicy = new TieredMergePolicy();
mergePolicy.setNoCFSRatio(1.0);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
indexWriterConfig.setMergePolicy(mergePolicy);
return indexWriterConfig;
}
commit()
operations are done as scheduled job, per tenant, in parallel every 20 seconds (only for index-writers that have pending changes). Locking per tenant (ReentrantLock) is in place.
The problem I have, is that when importing articles from kafka, after about 1mln records, the inode usage is 100%
and disk space usage is 93%
.
In that moment:
~16GB
.6
indexes (tenants) with number of files in the index around ~300k
, 14
with 10-75k
files, 22
with 1-8k
etc..cfe, .cfs, .si
files.NoCFSRatio
to 100%
to always create compund files. Despite that, after 1 mln I am running out of inodes.I plaied with different config for TieredMergePolicy
(like: SegmentsPerTier(3)
, MaxMergeAtOnce(100)
, MaxMergedSegmentMB(2048)
) to no avail.
On different machine, where there is more disk space, less tenants (about 10
), less articles to import (4.5mln
), the entire process works fine.
But I believe there is something I can do to improve the process so it works on the other (weaker) machine as well. What could I do to deal with inode exhaustion during index build up ? any help is welcome
Upvotes: 1
Views: 106