Janusz Skonieczny
Janusz Skonieczny

Reputation: 19030

How to generate pdf documents page-by-page in background tasks on App Engine

I need to generate a 100+ pages PDF documents. The process take a lot of data to process, and all-at-once generation takes more time and memory that I can give.

I have tried a few different methods to hack my way though:

With varying result I got it working, but it is slow and takes more memory than it should (sometimes hitting instance soft memory limit). Currently I generate some sections in different tasks storing each in blobstore and merge those with pyPdf, but it chokes on larger documents.

The document I'm generating is not that complicated, mostly tables and text, no internal references, no TOC, no anything that should be aware of the rest of the document. I can live with platypus for layouting and I do not need no fancy document look or HTML2PDF conversion.

The goal is to generate the document as fast as datastore will allow it. Parallel page generation would be nice but is not required.

I was thinking of page-by-page generation with blobstore files api, where each task would generate a single page and last task would finalize blobstore file making it readable. But I cant seem to find on how to, pause generation, store partial PDF to stream, and them resume generation with that stream to generate next page in a different task.

So my question is:

How on GAE generate a larger than a few pages PDF document, splitting the generation between task requests, then store the resulting document in the blobstore?

If generation splitting is not possible with reportlab, then how to minimize the footprint of merging different PDF documents so it would fit the limits set by GAE task request?

UPDATE: Alternatives to Conversion API much appreciated.

2nd UPDATE Conversion API is being decommissioned, so that's not an option now.

3rd UPDATE Can Pileline or MapReduce API's help here?

Upvotes: 12

Views: 1385

Answers (2)

user734094
user734094

Reputation:

I suggest installing wkhtmltopdf on app engine. Wkhtmltopdf is a command line tool to render html into pdf.

Create the html files and then convert them to pdf one by one using wkhtmltopdf.

On windows you can use (under linux based systems it's something similar):

def create_pdf(in_html_file=None, out_pdf_file=None, quality=None):
    pathtowk = 'C:/wkhtmltopdf/bin/wkhtmltopdf.exe {0} {1} {2}'    

    if quality == 1: # super quality no compression
        args_str = '--encoding utf-8 --disable-smart-shrinking --no-pdf-compression --page-size A4 --zoom 1 -q -T 15.24mm -L 25.4mm -B 20.32mm  -R 33.02mm'
    elif quality == 2: # moderate quality some compression
        args_str = '--encoding utf-8 --disable-smart-shrinking --page-size A4 --zoom 1 -q -T 15.24mm -L 25.4mm -B 20.32mm  -R 33.02mm'
    else: # poor quality max compression
        args_str = '--encoding utf-8 --page-size A4 --zoom 1 -q -T 15.24mm -L 25.4mm -B 20.32mm  -R 33.02mm'

    os.system(pathtowk.format(args_str, in_html_file, out_pdf_file))

Alternatively you can use subprocess.call(pathtowk.format(args_str, in_html_file, out_pdf_file)) to execute wkhtmltopdf (it is better to my opinion).

When conversion process get completed use PyPdf2 for merging the generated PDFs into single file.

Upvotes: 1

Peter Knego
Peter Knego

Reputation: 80340

Take a look at the new Conversion API: https://developers.google.com/appengine/docs/python/conversion/overview

Upvotes: 1

Related Questions