Reputation: 83393
I want to programmtically create a PDF of an image I've OCR'ed. I can to make it selectable/searchable.
I know what and where each letter is. My thought was to create a invisible text letter at each location.
But can I somehow "connect" the letters so they can be selected, e.g. O-v-e-r-f-l-ow?
I though of try to guess the horizontal size of the letters, and then writing the entire line, but fonts vary quite a bit in their widths (e.g. monospaced or not), so it might not match up.
I have seen selectable/searchable OCRed PDFs before, but I do not know how that can be implemented, or what PDF "feature" is used. How is this done?
Upvotes: 1
Views: 595
Reputation: 90315
To see how OCR text in PDF really works, see also this answer, over at SuperUser.com:
It's worthwhile to play with the command line tools, commands and instructions demonstrated in this answer, using an OCR'd PDF file you have around. You'll learn everything you need to solve your problem to write "invisible" text.
When you put text objects into a PDF, there are different modes available how to render this text. I've copied the following table from the official PDF-1.7 specification:
Now, guess, what OCR'ed text in PDFs uses?
Exactly, you are right: it uses Mode 3: Neither fill nor stroke text (invisible).
The PDF page drawing operator to set the text rendering is Tr
, the code to switch to mode 3 is simply 3 Tr
. It has to come before any text that you write (remember, PDF is like PostScript and uses a reverse notation: first value, then operator).
TL;DR: Whenever there is text rendered on a PDF page in mode 3, this text will be searchable, selectable and copy-able in any viewer, though it is invisible!
Upvotes: 2
Reputation: 159
If you simply write out the characters in order into the PDF, then most PDF readers, when someone does text selection/search will figure out the words on the go, based on spacing.
On the other hand have you tried the latest tesseract-ocr? They have full PDF output now. Not sure of the output with regards to text selection is up to your standards but you might want to try it out at least.
Upvotes: 0