Gaurav Saini
Gaurav Saini

Reputation: 81

How to generate screenshots for pdf pages using puppeteer and Node js

I am creating a screenshot generator using puppeteer and node js. It works fine for normal web pages, but for pdf pages it always gives the same error everytime I run it

Here's the code(first example from https://github.com/GoogleChrome/puppeteer)

const puppeteer = require('puppeteer');

(async () => {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf');
        await page.screenshot({ path: 'example.png' });
        await browser.close();
    } catch (err) {
        console.log(err);
    }
})();

The error that I get

Error: net::ERR_ABORTED at https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
    at navigate (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\FrameManager.js:121:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
  -- ASYNC --
    at Frame.<anonymous> (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\helper.js:110:27)
    at Page.goto (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\Page.js:629:49)
    at Page.<anonymous> (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\helper.js:111:23)
    at C:\MEAN\puppeteer-demo\index.js:7:20
    at process._tickCallback (internal/process/next_tick.js:68:7)

Any help is appreciated. I'm also open to any other possible solutions.

Upvotes: 8

Views: 6665

Answers (4)

Sam Sussman
Sam Sussman

Reputation: 1045

As @kalana-perera mentioned, @aaditya-chakravarty's solution was low resolution and stretched. Made some modifications to output a full, undistorted image of the PDF's first page.

Using typescript with the latest version of PDF.js.

async function generatePdfPreview(pdfUrl: string) {
  const browser = await puppeteer.launch({
    headless: "new",
    defaultViewport: null,
    args: [
      "--no-sandbox",
      "--disable-setuid-sandbox",
      "--disable-web-security",
      "--disable-features=IsolateOrigins",
      "--disable-site-isolation-trials",
    ],
  });
  const page = await browser.newPage();
  await page.setContent(
    previewCreatorPage(pdfUrl)
  );
  await page.waitForSelector("#renderingComplete");
  await page.waitForNetworkIdle();
  const pdfPage = await page.$("#page");
  const screenshot = pdfPage!.screenshot({
    type: "png",
    omitBackground: true,
  });

  return screenshot;
}

function previewCreatorPage(url: string) {
  return `<html lang="en">

  <head>
      <meta charset="UTF-8">
      <meta http-equiv="X-UA-Compatible" content="IE=edge">
      <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
      <style>
          body {
              width: 100vw;
              height: 100vh;
              margin: 0px;
          }
          #page {
              display: flex;
              width: 100%;
          }
      </style>
  
      <title>Document</title>
  </head>
  
  <body>
      <canvas id="page"></canvas>
      <script src="https://mozilla.github.io/pdf.js/build/pdf.js"></script>
      <script>
          var pdfjsLib = window['pdfjs-dist/build/pdf'];
          (async () => {
              const pdf = await pdfjsLib.getDocument('${url}').promise;
              const page = await pdf.getPage(1);
  
              const viewport = page.getViewport({ scale: 1 });
          
              const canvas = document.getElementById('page');
              const context = canvas.getContext('2d');
  
              canvas.height = viewport.height;
              canvas.width = viewport.width;
  
              const renderContext = {
                  canvasContext: context,
                  viewport: viewport
              };
  
            await page.render(renderContext).promise;

            const completeElement = document.createElement("span");
            completeElement.id = 'renderingComplete';
            document.body.append(completeElement);
          })();
      </script>
  </body>
  `;
}
  • defaultViewport: null will allow larger images than 800x600.
  • kept width: 100% and removed height: 100%
  • using pdf.js latest (3?)
  • screenshot just the canvas (#page) in the page instead of the whole thing.

Edit:

  • updated with @terraloader's solution to improve timing

Upvotes: 3

Aaditya Chakravarty
Aaditya Chakravarty

Reputation: 123

For anyone stumbling on this question now, I did it by using a combination of Puppeteer, EJS and PDF.js since puppeteer by itself does not view PDF files.

My approach was basically using EJS to dynamically add a URL which will be viewed through PDF.js and then puppeteer will take a screenshot of it.

Here's the JS part

const ejs = require('ejs');
const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({ 
        args: [
            '--disable-web-security',
            '--disable-features=IsolateOrigins',
            '--disable-site-isolation-trials'
        ]
    });
    const page = await browser.newPage();

    const url = "https://example.com/test.pdf";

    const html = await ejs.renderFile('./template.ejs', { data: { url } });

    await page.setContent(html);
    await page.waitForNetworkIdle();
    const image = await page.screenshot({ encoding: 'base64' });

    await browser.close();

    console.log('Image: ', image);
})();

I added chromium args in puppeteer launch to allow for no-cors loading of pdf file as per this answer.

Here's the EJS template

<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <style>
        body {
            width: 100vw;
            height: 100vh;
            margin: 0;
        }
        #page {
            display: flex;
            width: 100%;
            height: 100%;
        }
    </style>

    <title>Document</title>
</head>

<body>
    <canvas id="page"></canvas>
    <script src="https://unpkg.com/[email protected]/build/pdf.min.js"></script>
    <script>
        (async () => {
            const pdf = await pdfjsLib.getDocument('<%= data.url %>');
            const page = await pdf.getPage(1);

            const viewport = page.getViewport(1);
        
            const canvas = document.getElementById('page');
            const context = canvas.getContext('2d');

            canvas.height = viewport.height;
            canvas.width = viewport.width;

            const renderContext = {
                canvasContext: context,
                viewport: viewport
            };

            page.render(renderContext);
        })();
    </script>
</body>

</html>

Do note that this code will take a screenshot of only the first page.

Upvotes: 2

divyanshu
divyanshu

Reputation: 141

Chromium does not allow to open pdf files in headless true mode, use instead headless false mode. await puppeteer.launch({args: ['--no-sandbox'], headless: false })

Upvotes: 0

Thomas Dondorf
Thomas Dondorf

Reputation: 25280

Headless Chrome is not able to visit PDF pages and will throw the error Error: net::ERR_ABORTED as you are experiencing. Although you can visit a PDF document with headless: false, taking a screenshot will also fail, as the PDF is not a real website and actually rendered inside a separate view.

Alternative approach

What you can do instead, is download the page and use PDF.js to create an image of the page. You might want to check out other information on the topic of "pdf to image" or "pdf preview". There are multiple questions on stackoverflow (1, 2, ..) regarding that topic and also examples on the PDF.js page itself.

Upvotes: 4

Related Questions