Mike Miller
Mike Miller

Reputation: 263

Could not find x-ref table PDF

I'm trying to load a PDF file so I can extract it as an image. I've tried a couple of packages in Python e.g. PyPDF2, but each time I encounter the message "Could not find xref table at specified location".

I don't have any experience with PDFs and Python, so any tips would be appreciated. An example file is given here:

https://beta.companieshouse.gov.uk/company/00002404/filing-history

where the PDF is the 'full accounts' link.

Many thanks in advance!

Upvotes: 5

Views: 12328

Answers (3)

Wesley - Synio
Wesley - Synio

Reputation: 684

As mentioned by gettalong, you could use qpdf to fix a corrupted PDF. Nowadays you can also simply use (which is based on qpdf) instead of PyPDF2. That library is able to work well with corrupted PDFs because it is based on qpdf.

Example:

import pikepdf
pdf = pikepdf.Pdf.open(file)

Pikepdf docs: https://pikepdf.readthedocs.io/en/latest/

Upvotes: 3

gettalong
gettalong

Reputation: 830

You can use QPDF for this since it has a faulty xref table recovery method.

Just run qpdf broken.pdf repaired.pdf where broken.pdf is the broken input PDF and repaired.pdf is the new file name.

I tried it with the PDF you linked to and it worked fine.

Upvotes: 7

mkl
mkl

Reputation: 95918

The PDF in question is broken: The offset of the cross reference table and most object offsets in it are completely wrong.

E.g. the PDF claims that the cross reference table starts at file position 24732 but it actually starts at position 1594356. And the cross reference table entry for object 208 claims it to be at position 24713 while it actually is at 1594337.

Thus the observed error message "Could not find xref table at specified location" is completely correct.

The first offsets in the table are correct, though, at first glance up to the first image stream.

It appears as if the software producing the PDF did not count image stream contents when determining object offsets. Or it took a template with very small placeholder images and replaced the image streams of these small images by much larger streams without updating cross reference offsets.

Upvotes: 5

Related Questions