Reputation: 263
I'm trying to load a PDF file so I can extract it as an image. I've tried a couple of packages in Python e.g. PyPDF2, but each time I encounter the message "Could not find xref table at specified location".
I don't have any experience with PDFs and Python, so any tips would be appreciated. An example file is given here:
https://beta.companieshouse.gov.uk/company/00002404/filing-history
where the PDF is the 'full accounts' link.
Many thanks in advance!
Upvotes: 5
Views: 12328
Reputation: 684
As mentioned by gettalong, you could use qpdf to fix a corrupted PDF. Nowadays you can also simply use pikepdf (which is based on qpdf) instead of PyPDF2. That library is able to work well with corrupted PDFs because it is based on qpdf.
Example:
import pikepdf
pdf = pikepdf.Pdf.open(file)
Pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Upvotes: 3
Reputation: 830
You can use QPDF for this since it has a faulty xref table recovery method.
Just run qpdf broken.pdf repaired.pdf
where broken.pdf
is the broken input PDF and repaired.pdf
is the new file name.
I tried it with the PDF you linked to and it worked fine.
Upvotes: 7
Reputation: 95918
The PDF in question is broken: The offset of the cross reference table and most object offsets in it are completely wrong.
E.g. the PDF claims that the cross reference table starts at file position 24732 but it actually starts at position 1594356. And the cross reference table entry for object 208 claims it to be at position 24713 while it actually is at 1594337.
Thus the observed error message "Could not find xref table at specified location" is completely correct.
The first offsets in the table are correct, though, at first glance up to the first image stream.
It appears as if the software producing the PDF did not count image stream contents when determining object offsets. Or it took a template with very small placeholder images and replaced the image streams of these small images by much larger streams without updating cross reference offsets.
Upvotes: 5