Ruediger Jungbeck
Ruediger Jungbeck

Reputation: 2964

Embedding documents in PDF files

We want to store some application specific metadata (a JSON object) within PDF documents that we create.

We tried to use canvas.setKeyword and PdfFileReader.documentInfo["/Keywords"]

This works with a 100 KB file, but hangs with a 1 MB file (documentInfo actually return but needs a long time > 1min)

Is there another way to embed a file into a PDF document with reportlab? Is there another way to read it back with PyPDF2?

Upvotes: 3

Views: 3149

Answers (2)

sspiff
sspiff

Reputation: 41

A bit late, but I needed to embed data in a reportlab-created PDF as well, and eventually came up with the following. It stores the data as an EmbeddedFile stream in the PDF. To find the data later, it stores the PDF object reference as a keyword (this is not "standard", the PDF specification defines other ways of locating/naming the EmbeddedFile stream, but it works). The data is extracted using PyPDF2.

# embeds data in the given reportlab.pdfgen.canvas, addressed by key.
# returns a string that must be added to canvas as a keyword
#
def canvas_embed(canvas, key, data):
    from reportlab.pdfbase import pdfdoc
    # create a stream object to hold the embedded data
    s = pdfdoc.PDFStream(
        content=data,
        filters=[pdfdoc.PDFBase85Encode, pdfdoc.PDFZCompress])
    s.dictionary['Type'] = '/EmbeddedFile'
    # add it to the pdf
    r = canvas._doc.Reference(s)
    # return a string representing the object reference.
    # we just use the two reference components concatenated with
    # the given key name:
    return '{}:{:d}:{:d}'.format(key,
        *canvas._doc.idToObjectNumberAndVersion[r.name])

# extract the embedded file identified by key from
# the given PyPDF2.pdf.PdfFileReader
#
def reader_extract(pdfreader, key):
    from PyPDF2.generic import IndirectObject
    # find the key in the pdf's keywords (reportlab canvas
    # separates keywords with ', '), and split it to get
    # the object reference
    for k in pdfreader.documentInfo['/Keywords'].split(', '):
        if k.startswith(key + ':'):
            refn, refv = [int(x) for x in k.split(':')[1:]]
            break
    # fetch the stream data
    return IndirectObject(refn, refv, pdfreader).getObject().getData()

# a quick test
#
if __name__ == '__main__':
    import StringIO
    from reportlab.pdfgen import canvas
    from PyPDF2.pdf import PdfFileReader

    pdfbuf = StringIO.StringIO()

    # create pdf with embedded data
    c = canvas.Canvas(pdfbuf)
    c.drawString(72.0, 72.0, 'embedded file test')

    embedkey = canvas_embed(
        canvas=c,
        key='myembeddeddata',
        data='some embedded data.')

    c.setKeywords(['SomeOtherKeyword', embedkey])

    c.showPage()
    c.save()
    pdfbuf.seek(0)

    # read embedded data from the pdf
    r = PdfFileReader(stream=pdfbuf)
    data = reader_extract(pdfreader=r, key='myembeddeddata')

    print 'Found: {}'.format(data)

Upvotes: 1

Patrick Maupin
Patrick Maupin

Reputation: 8137

(This may or may not be a good enough answer, but I don't yet have the reputation to be able to comment...)

One possible reason for the long delay might be the encoding process for the string. If you don't mind reading the PDF back in, adding the data and writing it back out, you might try pdfrw. (Disclaimer: I am the pdfrw author.) The code to do this would look something like:

    from pdfrw import PdfReader, PdfWriter
    trailer = PdfReader('source.pdf')
    trailer.Info.Keywords = my_json_string
    PdfWriter().write('dest.pdf', trailer)

If that isn't fast enough because of the string encoding, you could actually store the data in a stream somewhere else in the file (and even compress it, if desired).

Upvotes: 2

Related Questions