Reputation: 2312
I need to extract the text from a PDF that has already been transformed using a OCR program. Do I use a normal PDFReader to get the text or does an OCR transformed PDF require special handling?
Upvotes: 5
Views: 3074
Reputation: 49199
It depends on how it has been transformed. Many OCR apps put the text under the image in some way. Some do this by laying the text down first the placing the image on top. Some place the image on the bottom then lay the text on top using the "don't mark" transfer mode.
I mention this because I can't predict how any particular text extraction tool will respond to transparent text. In theory, it should just give you the text (this is what Acrobat does). Whether this happens in reality across all text extraction tools is anyone's guess.
Upvotes: 3
Reputation: 3183
There are a number of commercial SDK's for handling pdf files. http://www.foxitsoftware.com/pdf/sdk/activex/ Here's foxit's.
Upvotes: 0