How does one extract the actual text from pdf lines with an unrecognized encoding?

Question

To set the stage, I am using pikepdf. When extracting a pdf, I have first upgraded it to PDF/A using ghostscript.

In PDF/A format, I can easily render it to see text. The PDF is also a "True" Pdf in the sense that everything is structured except for the actual text, which appears to be either an image object or some sort of unrecognized encoding.

The question is: how do I determine whether it is actually an image or, if it is not an image, find the element explaining how to interpret the text encoding in a PDF/A pdf using pikepdf?

For example, a typical line of a "True" pdf will be:

'[ (C) -0.169646 (O) 0.165508 (N) -0.169646 (T) 0.16137 (A) -0.169646 (C) -0.173783 (T) 0.16137 ] TJ'

# aka "CONTACT" when parsed.

However, when inspecting the user data input in the pdf, a typical line might be:

'[ <00240007> 1 <0067003a0063> 1.00301 <0013001300130013> ] TJ'

# where I have anonymized the numbers

What I would like to do is de-mask the text, which is clearly visible in rendered state. But I am unsure where to go look for the encoding in the PDF header.

Is this information I can find in the PDF? And, if not, is there a way to determine what exactly these text snippets are? (e.g. pointers to image streams?)

How does one extract the actual text from pdf lines with an unrecognized encoding?

Answers (0)

Related Questions