Parse Body Text from PDF

Question

I have just recently been experimenting with parsing the text data from a PDF document using iTextSharp in a VB2010 app. the document doesn't contain any images or other fancy elements, just text. Ive read some articles and used some code snippets and it looks promising. However, what Ive been trying to do is just parse out the body of each page, minus a header or footer. I haven't found any guidance for that particular function.

Currently using the snippet found here Reading PDF content with itextsharp dll in VB.NET or C# but it parses all text in a page. There's got to be a way to just get the body. Or at least I hope so.

Bobrovsky · Accepted Answer

PDFs generally do not contain information about logical structure of contained text.

So there are no headers, footers, body, paragraphs and anything like this in a PDF. There is only bunch of operations like "draw this glyph here", "move to this position and draw that group of glyphs there". I wrote glyph and not character because PDFs are not required to contain readable text. Only visual appearance required to be specified.

One exception is Tagged PDF but most of PDFs in the wild are not tagged.

Given all of the above you are probably left with following approach:

Extract all text from each page
Analyze text and find similar parts at the beginning / end of each page
Remove similar parts

This is a heuristic-based detection, so it probably won't always give excellent results.

Parse Body Text from PDF

Answers (1)

Related Questions