Use PyPDF2 to detect Embedded Subset fonts in PDF

Question

I have modified the following script using PyPDF2 to traverse through a PDF and determine whether the PDF contains unembedded fonts. It works for figuring out the list of all fonts in the PDF, and which of those are embedded. However, some PDFs have fonts in which only the subset of the font used is embedded (see https://blogs.mtu.edu/gradschool/2010/04/27/how-to-determine-if-fonts-are-embedded/) - How do you determine in a PDF whether a subset of a font is embedded? Thank you!

from PyPDF2 import PdfFileReader
import sys

fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])

def walk(obj, fnt, emb):
    if '/BaseFont' in obj:
        fnt.add(obj['/BaseFont'])

    elif '/FontName' in obj and fontkeys.intersection(set(obj)):
        emb.add(obj['/FontName'])

    for k in obj:
        if hasattr(obj[k], 'keys'):
            walk(obj[k], fnt, emb)
        if type(obj) == PyPDF2.generic.ArrayObject:  # You can also do ducktyping here
            for i in obj:
                if hasattr(i, 'keys'):
                    walk(i, all_fonts, embedded_fonts)


    return fnt, emb

if __name__ == '__main__':
    fname = sys.argv[1]
    pdf = PdfFileReader(fname)
    fonts = set()
    embedded = set()

    for page in pdf.pages:
        obj = page.getObject()
        f, e = walk(obj['/Resources'], fonts, embedded)
        fonts = fonts.union(f)
        embedded = embedded.union(e)

    unembedded = fonts - embedded
    print 'Font List'
    pprint(sorted(list(fonts)))
    if unembedded:
        print '
Unembedded Fonts'
        pprint(unembedded)

KenS · Accepted Answer

By convention the PostScript name of a subset font in a PDF file has a name which begins with XXXXXX+ where 'X' is any upper case ASCII character.

See Section 5.3 of the PDF Reference Manual (version 1.7)

Additionally the presence of a CharSet or CIDSet in the font descriptor can be used to indicate a subset font (both of these are optional).

However, all of these are 'conventions', there is no actual guaranteed way to be sure that a font which does not have any of these conventions is not actually a subset font.

Use PyPDF2 to detect Embedded Subset fonts in PDF

Answers (1)

Related Questions