Reputation: 28316
I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.
Upvotes: 1
Views: 216
Reputation: 10418
Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.
Upvotes: 1
Reputation: 15868
If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
If you can get away with copy/pasta from Reader, then just go that route.
Upvotes: 1
Reputation: 3699
Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx
Upvotes: 0