Reputation: 2897
I am attempting to extract plain text out of a pdf document using pdf.js
and for some reason am unable to get past the Invalid PDF structure
error.
My code as such:
const pdfjslib = require('pdfjs-dist');
const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'
var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
console.log(doc);
return null
})
.catch((err)=>{
console.log(err)
});
I have tried other pdf documents coming from the same domain but all throws the same error:
...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
at Module._compile (internal/modules/cjs/loader.js:776:30)
name: 'InvalidPDFException',
message: 'Invalid PDF structure' }
Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.
Upvotes: 4
Views: 27970
Reputation: 11857
I have seen this type of poor file structure before. whilst most readers will initially accept them as full of deletions they cause later problems.
The info metadata has been culled along with many other redactions but the index has not been rebuilt cleanly. These files may cause distorted behaviours such as missing annotation and other processing oddities.
One important indexed entry is declared as active (the one for pages) then later declared as deleted, so the number of pages is technically invalidated. However as commented by @mkl the combined methods are valid, and will usually be passed as acceptable according to the accesability standards EXCEPT the title had been removed along with the metadata. Once the "Title" is added as part of the manual checks it is fully up to scratch, but will naturally remove redundant entries and add others so file size is reduced.
endobj
startxref
27552
%%EOF
Summary
The checker found no problems in this document.
Needs manual check: 0
Passed manually: 2
Failed manually: 0
Skipped: 1
Passed: 29
Failed: 0
Looking at the source file position 125 is the list of pages.
2 0 obj <</Type/Pages/Count 2/Kids[ 3 0 R 17 0 R] >> endobj
However then the whole table is encoded yet again.
trailer
<</Size 156/Root 1 0 R/Info 148 0 R/ID[<F8D1F9DF5B960D47837AA3719AED9B81><F8D1F9DF5B960D47837AA3719AED9B81>] >>
startxref
27702
%%EOF
xref
0 0
trailer
<</Size 156/Root 1 0 R/Info 148 0 R/ID[<F8D1F9DF5B960D47837AA3719AED9B81><F8D1F9DF5B960D47837AA3719AED9B81>] /Prev 27702/XRefStm 27153>>
startxref
30982
%%EOF
The best thing to do is run the file through any cleaning process to rationalise the working index even if that means a percentage increase in file size.
Beware avoid using Ghostscript to rebuild the file (without any redundant objects) it will reduce the count drastically from 156 down to 33. Thus the cleaned optimised file is much smaller, but has lost all accessibility data.
Avoid basic optimisation gs -sDEVICE=pdfwrite -oTestIt.pdf dc19-07.pdf
0000012448 00000 n
0000017006 00000 n
0000020240 00000 n
0000007646 00000 n
0000019954 00000 n
0000022761 00000 n
trailer
<< /Size 33 /Root 1 0 R /Info 2 0 R
/ID [<3488BED994E047F0C0FAB9AB78CA5FF8><3488BED994E047F0C0FAB9AB78CA5FF8>]
>>
startxref
24137
%%EOF
Mutool cleaning is more likely to keep desired source data.
mutool clean -m -s -f -i -gggg dc19-07.pdf clean-DC19-07.pdf
Upvotes: 1
Reputation: 2733
Browser console log errors did not help me to fix it.
I run a PHP app (Moodle) and I went to the PHP error log and saw some variables expected to be replaced within the html source body of my certificate to be generated.
Check your backend app for error logs and the html source body provided to PDF.js for missing and undefined variables.
Try starting over the html body provided to PDF.js from scratch will help debugging the source of the exception.
Upvotes: 0