Reputation: 380
I would like to use node (in my backend) to convert a large PDF with many (hundreds!) of pages to individual jpgs to store them in a database for further purposes.
For this I have chosen the npm package "gm" which uses "graphicsmagick" in the background.
I have encountered several big issues. For example, node seems to be unable to "digest" a large number of pages at a time. Since "gm" is synchronous, it does not wait but tries to start to convert all pages almost instantly which "freezes" my node application, i.e., it never stops working and it does not produce any pages. If I limit the number of pages to, say, 20, it works perfectly.
I could not find any documentation for "gm" or "graphicsmagick" providing "best practices" for converting (large) pdfs.
The 2 most relevant questions that I have are:
a) Is there a way to tell "graphicsmagick" to produce an individual jpg file for each pdf page? "imagemagick", for example, is doing this "out of the box". To be more specific
convert -density 300 test.pdf test.jpg
would produce files like "test-0.jpg", "test-1.jpg", "test-2.jpg", and so on while
gm convert -density 300 test.pdf test.jpg
only produces one jpg file (the first page of the pdf).
b) Is there a way with "gm" to reuse the same "Buffer" to produce jpg images? I assume that calling "gm" with a big buffer (of > 100MB) hundreds of times is not the best way to do it
Here is the code that I am using right now:
import gm from 'gm';
import fs from 'fs';
// Create "Buffer" to be used by "gm"
const buf = fs.readFileSync('test.pdf');
// Identify number of pages in pdf
gm(buf, 'test.pdf').identify((err: any, value: gm.ImageInfo) => {
if (err) {
console.log('err');
} else {
const actualArray: string[] = value.Format.toString().split(',');
let numPages: number = actualArray.length;
// Loop through all pages and produce desired output
for (let currentPage: number = 0; currentPage < numPages; currentPage++) {
gm(buf, `test.pdf[${currentPage}]`)
.density(150, 150)
.quality(90)
.write(`test${currentPage}.jpg`, (err: any) => {
if (err) console.log(err);
});
}
}
});
This approach
Is there a "best practice" approach to do this right? Any hint will be highly appreciated!
Upvotes: 1
Views: 592
Reputation: 11730
For commercially free open-source task you need to avoid those that depend on licensed GhostScript PDF handling in the background such as ImageMagick GraphicsMagick etc.
If it's for personal use then consider Ghostscript's sister MuTool. It's generally the fastest method, see: what is fastest way to convert pdf to jpg image?
So the best FOSS workhorse for this task is Poppler and the means to convert PDF into image pages is pdftoppm which has many output formats including 2 types of jpg. However, I recommend consider PNG as preferable output for documents. Any difference in size is more than compensated by clarity of pixels.
-png : generate a PNG file
-jpeg : generate a JPEG file
-jpegcmyk : generate a CMYK JPEG file
-jpegopt : jpeg options, with format<opt1>=<val1>[,<optN>=<valN>]*
Typical Windows command line
"bin\pdftoppm.exe" -png -r %resolution% "%filename.pdf%" "%output/rootname%"
Upvotes: 1
Reputation: 2018
You can try pdfimages utilite (poppler or xpdf project) for original images extract from pdf.
Upvotes: 0