Sebastian Vivten
Sebastian Vivten

Reputation: 103

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata

I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).

For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:

But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":

View of "Remove hidden text" function in Adobe Acrobat DC Pro

I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?

S.

P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/

Upvotes: 4

Views: 4450

Answers (1)

mkl
mkl

Reputation: 96064

Does anyone know how these programs are storing their hidden text information really?

You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:

  • Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
  • Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).

The difference between the latter two results is the choice of font used:

  • Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
  • Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.

Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.

Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Upvotes: 10

Related Questions