Reputation: 4339
I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for reading a PDF file, for example in the PDF viewer pdf-js.
window.addEventListener('change', function webViewerChange(evt) {
var files = evt.target.files;
if (!files || files.length === 0)
return;
// Read the local file into a Uint8Array.
var fileReader = new FileReader();
fileReader.onload = function webViewerChangeFileReaderOnload(evt) {
var buffer = evt.target.result;
var uint8Array = new Uint8Array(buffer);
PDFView.open(uint8Array, 0);
};
var file = files[0];
fileReader.readAsArrayBuffer(file);
PDFView.setTitleUsingUrl(file.name);
........
Can this code be used to extract images from a PDF file?
Upvotes: 18
Views: 40049
Reputation: 3401
In case anyone else stumbles upon this and doesn't want to implement the various cases him/herself, I finally found a library that does everything for me - pdf-img-convert. It uses pdf.js under the hood.
npm install pdf-img-convert
And use like this:
import { convert } from "pdf-img-convert";
const outputImages = await convert("/path/to/pdf.pdf");
const imagePaths = outputImages.map((image, i) => {
const path = "output" + i + ".png";
writeFileSync(path, image);
return path;
});
Upvotes: 0
Reputation: 19
Here is link to working example of getting images from pdf and adding alpha channel to Uint8ClampedArray to be able to display it. It displays images in canvas.
Example in codepen: https://codepen.io/allandiego/pen/RwVGbyj
Getting data url from canvas to be able to display it in img tag:
const canvas = document.createElement('canvas');
canvas.width = imageWidth;
canvas.height = imageHeight;
const ctx = canvas.getContext('2d');
ctx!.putImageData(imageData, 0, 0);
const dataURL = canvas.toDataURL();
Upvotes: 1
Reputation: 789
If you open a page with pdf.js
, for example
PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
doc.getPage(1).then(function (page) {
window.page = page;
})
})
you can then use getOperatorList
to search for paintJpegXObject
objects and grab the resources.
window.objs = []
page.getOperatorList().then(function (ops) {
for (var i=0; i < ops.fnArray.length; i++) {
if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
window.objs.push(ops.argsArray[i][0])
}
}
})
Now args
will have a list of the resources from that page that you need to fetch.
console.log(window.args.map(function (a) { page.objs.get(a) }))
should print to the console a bunch of <img />
objects with data-uri src=
attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.
It only works for embedded JPEG objects, but it's a start!
Upvotes: 26