Reputation:
I am attempting to build a Lucene index of about 5000 documents, and the index that is being created seems to be getting too large. I would like to know if there is a way to reduce the size of the index.
I am using Lucene 4.10, and the documents I want to index are various formats (.docx, .xlsx, .pdf, .rtf, .txt). The size of the directory containing the documents I am indexing is about 1Gb. After indexing 3000/5000 documents, the index size is already 10Gb. I haven't found any helpful information on what a normal ratio would be for directory size to index size, but a 10Gb index seems to be much too large for only 1Gb of documents.
To read in the documents, I am using the Tika 1.6 AutoDetectParser to generate a string containing the contents of each doc.
The following snipplet shows hows I am trying to build an index. After the index writer is created, it calls a method walkFiles() to traverse the document directory, reading in each document (using a "DocReader" class) and adding it to the index:
public void indexDocs() {
docDir = "C:/MyDocDir";
indexPath = "C:/DocIndex";
docIndexDir = FSDirectory.open(new File(indexPath));
analysis = new StopAnalyzer();
iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analysis);
iwc.setOpenMode(OpenMode.CREATE);
docIndex = new IndexWriter(docIndexDir, iwc);
addDoc = new Document();
walkFiles(docDir);
docIndex.close();
}
private void walkFiles(String docDir) {
File docRoot = new File(docDir);
File[] list = docRoot.listFiles();
if (list == null) return;
for ( File f : list ) {
if ( f.isDirectory() ) {
walkFiles( f.getAbsolutePath());
}
else {
String docName = f.getAbsolutePath();
DocReader readDoc = new DocReader(docName);
if (readDoc.readFile() ) {
String docPath = readDoc.getPath();
String docText = readDoc.getText();
Field pathField = new StringField("path", docPath, Field.Store.NO);
Field contentField = new TextField("contents", docText, Field.Store.NO);
addDoc.add(pathField);
addDoc.add(contentField);
docIndex.addDocument(addDoc);
}
}
}
}
Notice that I am using a StopAnalyzer and creating the contents Field using the Field.Store.NO parameter. I can't find much other helpful info on reducing the index size. I am also interested to know if anyone has real world figures on how big an index is compared to the total size of documents being index.
Upvotes: 0
Views: 2088
Reputation: 3941
I think you've got a coding problem, rather than a Lucene issue.
You're creating a single document, "addDoc" and re-using it while adding all your documents to the index. Not necessarily a bad idea (although, I probably wouldn't bother). But what you don't seem to be doing is clearing the data before adding another document. So that would lead to each successive document being added containing all the data of the preceding documents.
The simplest change to make to make would be to just create a new document every time you read a document and just add that to the index. Fingers crossed, the size of your index will plummet.
Good luck,
Upvotes: 5