Reputation: 1469
I have modified the following script using PyPDF2 to traverse through a PDF and determine whether the PDF contains unembedded fonts. It works for figuring out the list of all fonts in the PDF, and which of those are embedded. However, some PDFs have fonts in which only the subset of the font used is embedded (see https://blogs.mtu.edu/gradschool/2010/04/27/how-to-determine-if-fonts-are-embedded/) - How do you determine in a PDF whether a subset of a font is embedded? Thank you!
from PyPDF2 import PdfFileReader
import sys
fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])
def walk(obj, fnt, emb):
if '/BaseFont' in obj:
fnt.add(obj['/BaseFont'])
elif '/FontName' in obj and fontkeys.intersection(set(obj)):
emb.add(obj['/FontName'])
for k in obj:
if hasattr(obj[k], 'keys'):
walk(obj[k], fnt, emb)
if type(obj) == PyPDF2.generic.ArrayObject: # You can also do ducktyping here
for i in obj:
if hasattr(i, 'keys'):
walk(i, all_fonts, embedded_fonts)
return fnt, emb
if __name__ == '__main__':
fname = sys.argv[1]
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()
for page in pdf.pages:
obj = page.getObject()
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = fonts.union(f)
embedded = embedded.union(e)
unembedded = fonts - embedded
print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)
Upvotes: 0
Views: 1334
Reputation: 31159
By convention the PostScript name of a subset font in a PDF file has a name which begins with XXXXXX+ where 'X' is any upper case ASCII character.
See Section 5.3 of the PDF Reference Manual (version 1.7)
Additionally the presence of a CharSet or CIDSet in the font descriptor can be used to indicate a subset font (both of these are optional).
However, all of these are 'conventions', there is no actual guaranteed way to be sure that a font which does not have any of these conventions is not actually a subset font.
Upvotes: 1