PDFJS and PDF encoding

Question

We are implementing PDFJS to render pdf files on a website.

When trying to initiate a PDFdocument/Viewer as an arrayBuffer, we get al sorts of errors and the file is not rendered. When opening the same file in the viewer from url (DEFAULT_URL variable), the file renders fine.

There are however some files that do render as streams. Comparing these files in notepad shows they have different encoding/characters.

This piece of code is used to open the file in the viewer:

function rawStringToBuffer( str ) {
    var idx, len = str.length, arr = new Array( len );
    for ( idx = 0 ; idx < len ; ++idx ) {
        arr[ idx ] = str.charCodeAt(idx) & 0xFF;
    }
    return new Uint8Array( arr ).buffer;
}

function readSingleFile(e) {
  var file = e.target.files[0];
  if (!file) {
    return;
  }
  var reader = new FileReader();
  reader.onload = function(e) {
    var contents = e.target.result;

    var uint8array = rawStringToBuffer(contents);

    pdfjsframe.contentWindow.PDFViewerApplication.open(uint8array,0);

    };
    reader.readAsText(file);
}

test.pdf helloworld pdf which is not rendered with code above.

test2.pdf helloworld pdf which does rendered with code above.

The behaviour is not browser dependent. The build is b15f335.

Is there something with the code or default configuration of the viewer so that test.pdf can not be rendered by the viewer?

rhashimoto · Accepted Answer

I don't think that your string conversion routine rawStringToBuffer() does what you want. You are reading the file as text, which transforms UTF-8 to UTF-16. But rawStringToBuffer() just takes the low order byte of each UTF-16 character and discards the high order byte, which is not the inverse transform. This will work with 7-bit ASCII data, but not with other characters. The best way to convert a string to UTF-8 is with the TextEncoder API (not supported on all browsers but polyfills are available).

However, converting the data from UTF-8 and back again is unnecessary. Just use FileReader.readAsArrayBuffer() instead of readAsText() to produce your ArrayBuffer directly.

Here's an (untested) replacement function:

function readSingleFile(e) {
  var file = e.target.files[0];
  if (!file) {
    return;
  }
  var reader = new FileReader();
  reader.onload = function(e) {
    var contents = e.target.result;

    pdfjsframe.contentWindow.PDFViewerApplication.open(contents, 0);
  };
  reader.readAsArrayBuffer(file);
}

PDFJS and PDF encoding

Answers (1)

Related Questions