How is this pdf encoded? The font looks funny

Question

I have seen this effect many times while reading pdf documents. So, some pdf have this funny smudged font which looks like it is a scanned image. However, I am able to select the font, and while selecting it the highlighted font appears differently as seen in the images.

Default appearance

Appearance on selection of font highlight appearance

Overall, seems like some ocr is happening behind the scene. The document reader I am using is Atril 1.12.2 document viewer.

My question is: What is encoded in the pdf, image or text? What is happening to text when I am selecting it?

mkl · Accepted Answer

Another nice change can be observed in the document shared by the OP:

What we see here indeed is the result of OCR. But it's not some ocr happening behind the scene in the viewer, OCR has already happened before and the results have been integrated into the PDF.

The PDF page actually contains a scanned image upon which invisible text is drawn.

As long as nothing is selected, Atril shows exactly that, you only see the scanned image. As soon as you start selecting text, though, it appears to cover the marked area in blue and display the marked (formerly invisible) text in white upon it.

In situations, therefore, in which the invisible text is not added exactly above the corresponding letters in the image, this might result in funny gaps like the one in the OP's screenshot after "multidimensional". In case of errors in the OCR output, one sees the erroneous data like in my screenshots.

Other PDF viewer often merely mark the text by applying some effect to the text area, e.g. inverting colors or overlaying a semi-transparent color.

It might be considered an advantage of the Atril approach that already in the selection process one sees the exact text one is selecting and probably eventually going to copy.

Inside the content stream

As mentioned above, the PDF page actually contains a scanned image upon which invisible text is drawn.

In the page content stream the corresponding instructions look like this:

1 0 0 1 0 0.2401 cm

(shift the coordinate system a minute bit up)

1 1 1 rg
1 i
/RelativeColorimetric ri
/R794 gs
0 0 576 719.5 re
f

(filling the image area to be with white color)

q
576 0 0 719.5 0 0 cm
/Im0 Do
Q

(drawing the bitmap image)

1 0 0 1 0 -0.2401 cm

(shift the coordinate system a minute bit down, undoing the initial upshift)

BT

(beginning a text object)

0 0 0 rg

(setting the fill color to black)

/TT1 1 Tf
0.05 Tc
0 Tw
3 Tr

(selecting the font TT1 at size 1, a bit of extra space between characters, no extra space between words, and text rendering mode 3, i.e. invisible)

7.3 0 0 7.3 83.8 678.4401 Tm
(SOFTWARE-PRACTICE ) Tj

(setting the text coordinate system to be shifted by 83.8 horizontally and 678.4401 vertically and to be scaled by 7.3 and drawing some text)

0.08 Tc
7.4 0 0 7.1 175.2 678.4401 Tm
(AND ) Tj

(changing character spacing a bit, setting the text coordinate system to be shifted by 175.2 horizontally and 678.4401 vertically and to be scaled by 7.4 horizontally and 7.1 vertically and drawing some text)

...

TL;DR

What is encoded in the pdf, image or text?

Both, the image plus invisible text upon it.

What is happening to text when I am selecting it?

Atril covers the text in blue and draws the selected (formerly invisible) text upon it in white.

How is this pdf encoded? The font looks funny

Answers (1)

Inside the content stream

TL;DR

Related Questions