Reputation: 190
I try to convert a webpage to PDF using the pdfkit. This works fine when using an URL such as google.com. But when I try to convert a webpage build in NextJS the PDF kit keeps loading without any response.
I'm using imbd.com as example because they're also using NextJS.
import pdfkit
try:
options = {
# 'page-size': 'A4',
'encoding': 'utf-8',
'margin-top': '0cm',
'margin-bottom': '0cm',
'margin-left': '0cm',
'margin-right': '0cm',
# 'image-quality': '1000',
# 'image-dpi': '2000',
'disable-smart-shrinking': '',
'page-width': '595px',
'page-height': '842px',
'no-outline': None,
'javascript-delay': '1000',
"load-error-handling": "ignore"
}
pdfkit.from_url(
'https://www.imdb.com/', 'out.pdf', options=options, verbose=True)
except Exception as e:
raise e
When running the above script nothing happens. And there is also no error logging that I can use while debugging.
For debugging I tried to interact with wkhtmltopdf
directly and I still don't get any output.
My command:
$ wkhtmltopdf --javascript-delay 5000 --debug-javascript http://imdb.com out.pdf
The output: The loader freezes at 87% percent and there is no output that can help me figure out what's going wrong.
Loading pages (1/6)
[====================================================> ] 87%
Upvotes: 0
Views: 856
Reputation: 11811
You have two major issues one is browser security the other printing a web page as PDF.
By far the best simplest means to print a web site without any secondary security issues is using a browser, so that reference is not problem.
Chrome --headless [Disable$*] --run-all-compositor-stages-before-draw --no-pdf-header-footer --print-to-pdf="/folder/out.pdf" https://www.imdb.com
The main problem is it will be PERFECTLY / EXACTLY as if you hit print a web page to pdf without adjusting any printout settings. So it should be A4 Portrait, and thus media box should not be a problem, but to set Margins is more difficult, especially as it's not designed to be command line driven, thus requires a puppeteer (or simpler for basic usage e.g. sendkeys).
If you need to make commercial level adjustments then consider a commercial URL2PDF SDK/API solution, there are several designed to work with Python
Upvotes: 0
Reputation: 276
You are not using the right package, pdfkit is good for capturing sites that are rendered on the server side. NextJs like React do not render on server and server the rendered HTML , but they render on the client side, hence the wait forever to render. For this you mai try to see the difference by using a curl to get the HTML of the website, you will see a lot of javascript which is rendered on the client side.
For this you need a library which uses chrome headless for example , I recommend pyhtml2pdf . You need to :
pip install pyhtml2pdf
#install also chrome or chromium if is not installed
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt -y install ./google-chrome-stable_current_amd64.deb
and after you may use the bellow simple code to capture the website:
from pyhtml2pdf import converter
converter.convert('https://www.imdb.com', 'sample.pdf')
Works as expected and the website is rendered.
Upvotes: 0
Reputation: 1282
In you options object, under 'javascript-delay', try increasing the threshold to 5000.
Upvotes: 0