Reputation: 1
Currently I'm extracting the text of PDF's with the itextsharp tool (in VB.net). I'd like to be independent of other tools / libraries as I can't give them to others along my programm.
Is there a solution (no .dll etc) in any programming language to quickly extract the text of a PDF?
Upvotes: 0
Views: 8462
Reputation: 9022
Short answer:
Of course there is a way of doing this. iText (alongside many other PDF libraries) are capable of doing it. So there is an algorithm for extracting text.
Long answer:
PDF is not a WYSIWYG format. A PDF document is sort of an ungodly marriage between "objects that reference eachother" and "programming language".
Let me explain. A PDF document has a graphics state. So whenever you see text in a PDF document (in a viewer like Adobe Reader), you are essentially seeing the result of some 'code' in the PDF document that says
Go to position 50, 720
Set the active font to Helvetica, fontsize 12
Set the active drawing color to black
draw the glyph that corresponds to the character 'H'
Go to position 53, 720
draw the glyph that corresponds to the character 'e'
etc
Instructions and resources (like fonts, images, vector graphics) can be grouped together in objects.
Each object is assigned a number, and is mentioned explictly in the cross-reference table (at the end of the PDF document).
So, in order to read the text from a PDF document you would need to:
And that is probably why other people use libraries. Don't get me wrong, I'm a huge fan of doing it yourself (it's the best way to gain a deep knowledge on how certain things work).
But look at it from the point of view of one of your users. What would you trust more?
Upvotes: 6