Reputation: 380

How to efficiently convert a large pdf with many pages into individual (high-res) jpgs with node (in the backend) using (for example) graphicsmagick?

I would like to use node (in my backend) to convert a large PDF with many (hundreds!) of pages to individual jpgs to store them in a database for further purposes.

For this I have chosen the npm package "gm" which uses "graphicsmagick" in the background.

I have encountered several big issues. For example, node seems to be unable to "digest" a large number of pages at a time. Since "gm" is synchronous, it does not wait but tries to start to convert all pages almost instantly which "freezes" my node application, i.e., it never stops working and it does not produce any pages. If I limit the number of pages to, say, 20, it works perfectly.

I could not find any documentation for "gm" or "graphicsmagick" providing "best practices" for converting (large) pdfs.

The 2 most relevant questions that I have are:

a) Is there a way to tell "graphicsmagick" to produce an individual jpg file for each pdf page? "imagemagick", for example, is doing this "out of the box". To be more specific

convert -density 300 test.pdf test.jpg

would produce files like "test-0.jpg", "test-1.jpg", "test-2.jpg", and so on while

gm convert -density 300 test.pdf test.jpg

only produces one jpg file (the first page of the pdf).

b) Is there a way with "gm" to reuse the same "Buffer" to produce jpg images? I assume that calling "gm" with a big buffer (of > 100MB) hundreds of times is not the best way to do it

Here is the code that I am using right now:

import gm from 'gm';
import fs from 'fs';

// Create "Buffer" to be used by "gm"
const buf = fs.readFileSync('test.pdf');

// Identify number of pages in pdf
gm(buf, 'test.pdf').identify((err: any, value: gm.ImageInfo) => {
  if (err) {
    console.log('err');
  } else {
    const actualArray: string[] = value.Format.toString().split(',');
    let numPages: number = actualArray.length;
    // Loop through all pages and produce desired output
    for (let currentPage: number = 0; currentPage < numPages; currentPage++) {
      gm(buf, `test.pdf[${currentPage}]`)
        .density(150, 150)
        .quality(90)
        .write(`test${currentPage}.jpg`, (err: any) => {
          if (err) console.log(err);
        });
    }
  }
});

This approach

does not work with large pdfs (at least not on my machine)
is very slow (presumably because it calls "gm" hundreds of times almost instantly with a "Buffer" of > 100MB)

Is there a "best practice" approach to do this right? Any hint will be highly appreciated!

Upvotes: 1

Answers (2)

K J

Reputation: 11730

For commercially free open-source task you need to avoid those that depend on licensed GhostScript PDF handling in the background such as ImageMagick GraphicsMagick etc.

If it's for personal use then consider Ghostscript's sister MuTool. It's generally the fastest method, see: what is fastest way to convert pdf to jpg image?

So the best FOSS workhorse for this task is Poppler and the means to convert PDF into image pages is pdftoppm which has many output formats including 2 types of jpg. However, I recommend consider PNG as preferable output for documents. Any difference in size is more than compensated by clarity of pixels.

For OCR use ppm
For DOCuments / LineART use PNG
for Photos use standard JPEG

-png : generate a PNG file
-jpeg : generate a JPEG file
-jpegcmyk : generate a CMYK JPEG file
-jpegopt : jpeg options, with format <opt1>=<val1>[,<optN>=<valN>]*

Typical Windows command line

"bin\pdftoppm.exe" -png  -r %resolution% "%filename.pdf%" "%output/rootname%"

Upvotes: 1

Alex Alex

Reputation: 2018

You can try pdfimages utilite (poppler or xpdf project) for original images extract from pdf.

Upvotes: 0

How to efficiently convert a large pdf with many pages into individual (high-res) jpgs with node (in the backend) using (for example) graphicsmagick?

Answers (2)

Related Questions