Reputation: 1

PDF itext TOC generation

I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.

Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1. Generating TOC is next challenge.

Upvotes: 0

Answers (1)

Mark Storer

Reputation: 15868

You don't. PDFs don't have styles. They have "current Graphic State", which includes:

current transformation matrix (CTM).
stroke & fill colors
clipping path
font & size
gobs of other text state stuff (char spacing, word spacing, leading, text render mode...)
- Including a separate text transformation matrix which is combined with the CTM.

So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).

And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.

You'll need to write a TextRenderListener that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.

Upvotes: 0

PDF itext TOC generation

Answers (1)

Related Questions