michael holstein
michael holstein

Reputation: 190

How to convert a NextJS website to pdf using python PDFkit

I try to convert a webpage to PDF using the pdfkit. This works fine when using an URL such as google.com. But when I try to convert a webpage build in NextJS the PDF kit keeps loading without any response.

I'm using imbd.com as example because they're also using NextJS.

import pdfkit

try:
    options = {
        # 'page-size': 'A4',
        'encoding': 'utf-8',
        'margin-top': '0cm',
        'margin-bottom': '0cm',
        'margin-left': '0cm',
        'margin-right': '0cm',
        # 'image-quality': '1000',
        # 'image-dpi': '2000',
        'disable-smart-shrinking': '',
        'page-width': '595px',
        'page-height': '842px',
        'no-outline': None,
        'javascript-delay': '1000',
        "load-error-handling": "ignore"

    }
    pdfkit.from_url(
        'https://www.imdb.com/', 'out.pdf', options=options, verbose=True)
except Exception as e:
    raise e

What I try to solve:

When running the above script nothing happens. And there is also no error logging that I can use while debugging.

Update

For debugging I tried to interact with wkhtmltopdf directly and I still don't get any output.

My command:

$ wkhtmltopdf --javascript-delay 5000 --debug-javascript http://imdb.com out.pdf

The output: The loader freezes at 87% percent and there is no output that can help me figure out what's going wrong.

Loading pages (1/6)
[====================================================>       ] 87%

Upvotes: 0

Views: 856

Answers (3)

K J
K J

Reputation: 11811

You have two major issues one is browser security the other printing a web page as PDF.

By far the best simplest means to print a web site without any secondary security issues is using a browser, so that reference is not problem.

Chrome --headless [Disable$*] --run-all-compositor-stages-before-draw --no-pdf-header-footer --print-to-pdf="/folder/out.pdf" https://www.imdb.com 
  • Disable$ is a long list of --options as you desire based on browser configuration

enter image description here

The main problem is it will be PERFECTLY / EXACTLY as if you hit print a web page to pdf without adjusting any printout settings. So it should be A4 Portrait, and thus media box should not be a problem, but to set Margins is more difficult, especially as it's not designed to be command line driven, thus requires a puppeteer (or simpler for basic usage e.g. sendkeys).

If you need to make commercial level adjustments then consider a commercial URL2PDF SDK/API solution, there are several designed to work with Python

Upvotes: 0

Claudiu T
Claudiu T

Reputation: 276

You are not using the right package, pdfkit is good for capturing sites that are rendered on the server side. NextJs like React do not render on server and server the rendered HTML , but they render on the client side, hence the wait forever to render. For this you mai try to see the difference by using a curl to get the HTML of the website, you will see a lot of javascript which is rendered on the client side.

For this you need a library which uses chrome headless for example , I recommend pyhtml2pdf . You need to :

pip install pyhtml2pdf
#install also chrome or chromium if is not installed 
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt -y install ./google-chrome-stable_current_amd64.deb

and after you may use the bellow simple code to capture the website:

from pyhtml2pdf import converter

converter.convert('https://www.imdb.com', 'sample.pdf')

Works as expected and the website is rendered.

Upvotes: 0

Jobajuba
Jobajuba

Reputation: 1282

In you options object, under 'javascript-delay', try increasing the threshold to 5000.

Upvotes: 0

Related Questions