Reputation: 28316

How do I explore a PDF to determine if an element is text?

I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.

Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.

This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.

Upvotes: 1

Answers (3)

yms

Reputation: 10418

Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:

// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);

Additionally, it provides Tesseract OCR integration in case you needed it.

Disclaimer: I am part of the development team of this product.

Upvotes: 1

Mark Storer

Reputation: 15868

If Acrobat/Reader can select the text, then it Is Text.

Reasons your library might not be able to find the text in question:

Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.

If you can get away with copy/pasta from Reader, then just go that route.

Upvotes: 1

DJ Quimby

Reputation: 3699

Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx

Upvotes: 0

How do I explore a PDF to determine if an element is text?

Answers (3)

Related Questions