PDF parse area using Tika

Question

What I'm using: I'm using Apache Tika to parse a PDF on my Java application.

What I need: I need to parse a certain area (i.e. defined by a Rectangle object) of my PDF, as I usually did with iText.

Question: Is it possible to parse a defined area of my PDF using Apache Tika? How?

Gagravarr · Accepted Answer

Apache Tika will give you a simplified, normalised HTML representation of the document. For page-based formats, such as PDF or PPT, it will markup the page boundaries, but for non page-based formats (eg run-based .doc), it won't.

What you'll need to do is step down to Apache PDFBox, which is the underlying library which powers the PDF parser in Tika. Using PDFBox you can get the location of the objects on a given page, work out if they're in range you want, and get the text of them. It won't be quite as easy as using Apache Tika, but for that level of control you'll need to get more involved

PDF parse area using Tika

Answers (1)

Related Questions