Efficient way to extract text from PDF for Lucene indexing

Question

I am trying to extract the text content from a PDF file using Apache Tika and then passing the data to Lucene for indexing.

public static String extract(File file) throws IOException, SAXException, TikaException {

        InputStream input = new FileInputStream(file);
        ContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        new PDFParser().parse(input, handler, metadata, new ParseContext());
        String plainText = handler.toString();
        input.close();
        return plainText;
    }

My query is related to the call

handler.toString();

Now we are performing the extraction process using multiple threads (4 to 8, which is configurable by the user). So is there any other way to get a stream which we can feed to Lucene for indexing purpose. The reason being that I feel huge Strings will push for bigger heaps.

Currently the index if done as:

doc.add(new TextField(fieldName, ExtractPdf.extract(file), Field.Store.NO));

We need to extract and index approximately 500K documents of varied sizes from 50KB to 50MB.

Sabir Khan · Accepted Answer

I haven't worked on Apache Tika before but your question was interesting so I looked around & I don't see call to toString()being the root cause of problem.

As per my understanding - efficiency can be achieved by deciding if you always need FULL BODY TEXT irrespective of text being of any size OR your program logic can work OK if you retrieve only a partial body of N-LENGTH.

I am more than sure that you will always need full body text and your program won't work with partial body so all the efficiency that you can achieve ( assuming you always need full text ) is to break down that large string into chunks as illustrated here under section - Streaming the plain text in chunks with a custom content handler decorator. So memory wise, your program should still be capable to store such a large body but your body is broken into chunks which might simplify your downstream process of indexing.

Your program should list its memory requirements as per largest file size supported and with this approach you wouldn't get a relief there. So its a decision very early on as how large files you wish to handle.

Other option seems to develop a process where you parse same file multiple times in incremental way and that wouldn't be very efficient either ( Just suggesting as a possible approach , not sure if doable in Tika ).

Ahhh....Lengthy write up :)

Having said above points , you should also note that you should try to decouple file parsing & index steps so you can provide different tuning and configurations to each steps .

Either you can code a typical producer - consumer pattern using a thread safe blocking queue or you might go with Spring Batch API.

With Spring batch , your reader will be responsible for reading & parsing the files, reader will pass on List of String to processor and then a List of List of String will go to writer and writer will simply index few files in bulk as per your chunk size configuration.

Decoupling is mandatory here since you should note that Lucene IndexWriter is a thread -safe class and you can employ multiple threads to do faster indexing in addition to using multi threading at file parsing level.

Hope it helps !!

Also, note that a String in Java is garbage collected like any normal object if its not interned, see

Efficient way to extract text from PDF for Lucene indexing

Answers (1)

Related Questions