Chris
Chris

Reputation: 31206

How does one extract the actual text from pdf lines with an unrecognized encoding?

To set the stage, I am using pikepdf. When extracting a pdf, I have first upgraded it to PDF/A using ghostscript.

In PDF/A format, I can easily render it to see text. The PDF is also a "True" Pdf in the sense that everything is structured except for the actual text, which appears to be either an image object or some sort of unrecognized encoding.

The question is: how do I determine whether it is actually an image or, if it is not an image, find the element explaining how to interpret the text encoding in a PDF/A pdf using pikepdf?

For example, a typical line of a "True" pdf will be:

'[ (C) -0.169646 (O) 0.165508 (N) -0.169646 (T) 0.16137 (A) -0.169646 (C) -0.173783 (T) 0.16137 ] TJ'

# aka "CONTACT" when parsed.

However, when inspecting the user data input in the pdf, a typical line might be:

'[ <00240007> 1 <0067003a0063> 1.00301 <0013001300130013> ] TJ'

# where I have anonymized the numbers 

What I would like to do is de-mask the text, which is clearly visible in rendered state. But I am unsure where to go look for the encoding in the PDF header.

Is this information I can find in the PDF? And, if not, is there a way to determine what exactly these text snippets are? (e.g. pointers to image streams?)

Upvotes: 0

Views: 207

Answers (0)

Related Questions