Reputation: 31206
To set the stage, I am using pikepdf
. When extracting a pdf, I have first upgraded it to PDF/A
using ghostscript.
In PDF/A
format, I can easily render it to see text. The PDF is also a "True" Pdf in the sense that everything is structured except for the actual text, which appears to be either an image object or some sort of unrecognized encoding.
The question is: how do I determine whether it is actually an image or, if it is not an image, find the element explaining how to interpret the text encoding in a PDF/A
pdf using pikepdf
?
For example, a typical line of a "True" pdf will be:
'[ (C) -0.169646 (O) 0.165508 (N) -0.169646 (T) 0.16137 (A) -0.169646 (C) -0.173783 (T) 0.16137 ] TJ'
# aka "CONTACT" when parsed.
However, when inspecting the user data input in the pdf, a typical line might be:
'[ <00240007> 1 <0067003a0063> 1.00301 <0013001300130013> ] TJ'
# where I have anonymized the numbers
What I would like to do is de-mask the text, which is clearly visible in rendered state. But I am unsure where to go look for the encoding in the PDF header.
Is this information I can find in the PDF? And, if not, is there a way to determine what exactly these text snippets are? (e.g. pointers to image streams?)
Upvotes: 0
Views: 207