Reputation: 4338
I'm using Apache Tika to parse documents and generate both a plaintext version and an HTML preview of the document. I'm able to generate both just fine if I call the parse
function twice and pass in two separate ContentHandlers— this works great for text only documents. But when I get documents that require OCR with tesseract, it's a bit of a problem— it's extremely wasteful to call the parse
function twice because it does the OCR (which can take a minute or so) twice as well.
I know I can write my own ContentHandler, but just wondering if anyone knows of an out-of-the-box solution for this? Much appreciated!
Upvotes: 1
Views: 227
Reputation: 48346
Good news - Apache Tika provides something out of the box for this!
Just create your 2+ real Content Handlers, pass those to the constructor of TeeContentHandler, then hand the TeeContentHandler to Tika when you do the parse
Upvotes: 3