Reputation: 1
I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.
Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1. Generating TOC is next challenge.
Upvotes: 0
Views: 1111
Reputation: 15868
You don't. PDFs don't have styles. They have "current Graphic State", which includes:
So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).
And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.
You'll need to write a TextRenderListener
that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.
Upvotes: 0