Mika H.
Mika H.

Reputation: 4339

Extract images from PDF file with JavaScript

I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for reading a PDF file, for example in the PDF viewer pdf-js.

window.addEventListener('change', function webViewerChange(evt) {
  var files = evt.target.files;
  if (!files || files.length === 0)
    return;

  // Read the local file into a Uint8Array.
  var fileReader = new FileReader();
  fileReader.onload = function webViewerChangeFileReaderOnload(evt) {
    var buffer = evt.target.result;
    var uint8Array = new Uint8Array(buffer);
    PDFView.open(uint8Array, 0);
  };

  var file = files[0];
  fileReader.readAsArrayBuffer(file);
  PDFView.setTitleUsingUrl(file.name);
  ........

Can this code be used to extract images from a PDF file?

Upvotes: 18

Views: 40049

Answers (3)

Marek Lisý
Marek Lisý

Reputation: 3401

In case anyone else stumbles upon this and doesn't want to implement the various cases him/herself, I finally found a library that does everything for me - pdf-img-convert. It uses pdf.js under the hood.

npm install pdf-img-convert

And use like this:

import { convert } from "pdf-img-convert";

const outputImages = await convert("/path/to/pdf.pdf");
const imagePaths = outputImages.map((image, i) => {
  const path = "output" + i + ".png";
  writeFileSync(path, image);
  return path;
});

Upvotes: 0

kubanm3
kubanm3

Reputation: 19

Here is link to working example of getting images from pdf and adding alpha channel to Uint8ClampedArray to be able to display it. It displays images in canvas.

Example in codepen: https://codepen.io/allandiego/pen/RwVGbyj

Getting data url from canvas to be able to display it in img tag:

const canvas = document.createElement('canvas');
canvas.width = imageWidth;
canvas.height = imageHeight;
const ctx = canvas.getContext('2d');
ctx!.putImageData(imageData, 0, 0);
const dataURL = canvas.toDataURL();

Upvotes: 1

Jason Siefken
Jason Siefken

Reputation: 789

If you open a page with pdf.js, for example

PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
    doc.getPage(1).then(function (page) {
        window.page = page;
    })
})

you can then use getOperatorList to search for paintJpegXObject objects and grab the resources.

window.objs = []
page.getOperatorList().then(function (ops) {
    for (var i=0; i < ops.fnArray.length; i++) {
        if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
            window.objs.push(ops.argsArray[i][0])
        }
    }
})

Now args will have a list of the resources from that page that you need to fetch.

console.log(window.args.map(function (a) { page.objs.get(a) }))

should print to the console a bunch of <img /> objects with data-uri src= attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.

It only works for embedded JPEG objects, but it's a start!

Upvotes: 26

Related Questions