enamrik
enamrik

Reputation: 2312

Using C# to Search OCR (searchable) PDF

I need to extract the text from a PDF that has already been transformed using a OCR program. Do I use a normal PDFReader to get the text or does an OCR transformed PDF require special handling?

Upvotes: 5

Views: 3074

Answers (2)

plinth
plinth

Reputation: 49199

It depends on how it has been transformed. Many OCR apps put the text under the image in some way. Some do this by laying the text down first the placing the image on top. Some place the image on the bottom then lay the text on top using the "don't mark" transfer mode.

I mention this because I can't predict how any particular text extraction tool will respond to transparent text. In theory, it should just give you the text (this is what Acrobat does). Whether this happens in reality across all text extraction tools is anyone's guess.

Upvotes: 3

VoronoiPotato
VoronoiPotato

Reputation: 3183

There are a number of commercial SDK's for handling pdf files. http://www.foxitsoftware.com/pdf/sdk/activex/ Here's foxit's.

Upvotes: 0

Related Questions