Woody Pride
Woody Pride

Reputation: 13955

Reportlab PDF creating with python duplicating text

I am trying to automate the production of pdfs by reading data from a pandas data frame and writing it a page on an existing pdf form using pyPDF2 and reportlab. The main meat of the program is here:

def pdfOperations(row, bp):
    packet = io.BytesIO()
    can = canvas.Canvas(packet, pagesize=letter)
    createText(row, can)
    packet.seek(0)
    new_pdf = PdfFileReader(packet)
    textPage = new_pdf.getPage(0)
    secondPage = bp.getPage(1)
    secondPage.mergePage(textPage)
    assemblePDF(frontPage, secondPage, row)
    del packet, can, new_pdf, textPage, secondPage

def main():
    df = openData()
    bp = readPDF()
    frontPage = bp.getPage(0)
    for ind in df.index:
        row = df.loc[ind]
        pdfOperations(row, bp)

This works fine for the first row of data and the first pdf generated, but for the subsequent ones all the text is overwritten. I.e. the second pdf contains text from the first iteration and the second. I thought the garbage collection would take care of all the in memory changes, but that does not seem to be happening. Anyone know why?

I even tries forcing the objects to be deleted after the function has run its course, but no luck...

Upvotes: 0

Views: 481

Answers (1)

Jeronimo
Jeronimo

Reputation: 2387

You read bp only once before the loop. Then in the loop, you obtain its second page via getPage(1) and merge stuff to it. But since its always from the same object (bp), each iteration will merge to the same page, therefore all the merges done before add up.

While I don't find any way to create a "deepcopy" of a page in PyPDF2's docs, it should work to just create a new bp object for each iteration.

Somewhere in readPDF you must have done something where you open your template PDF into a binary stream and then pass that to PdfFileReader. Instead, you could read the data into a variable:

with open(filename, "rb") as f:
    bp_bin = f.read()

And from that, create a new PdfFileReader instance for each loop iteration:

for ind in df.index:
    row = df.loc[ind]
    bp = PdfFileReader(bp_bin)
    pdfOperations(row, bp)

This should "reset" the secondPage everytime without any additional file I/O overhead. Only the parsing is done again each time, but depending on the file size and contents, maybe the time that takes is low and you can live with that.

Upvotes: 1

Related Questions