PatentDeathSquad
PatentDeathSquad

Reputation: 553

Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found

pyPdf throws this exception:

pyPdf.utils.PdfReadError: EOF marker not found

I don't need to fix pyPdf, I just need to get the EOF error to cause an "except" block to execute and skip over the file, but it doesn't work. It still causes the program to stop running.

Background:

Batch OCR Program for PDFs

Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode

... the saga continues.

I got 10,000 pdfs in a folder. Some OCRd, some not. Can't tell 'em apart. Step 1 was to figure out which ones are not OCRd and OCR only those (see other threads for details).

So i'm using pyPdf. I get some exceptions related to unrecognized characters and unsupported filters when I try to Read the text. So I guestimated that if it throws an exception, it's got some text in it and then it doens't go in the list. Problem solved, right? Like so:

      from pyPdf import PdfFileWriter, PdfFileReader
      import sys, os, pyPdf, re

      path = 'C:\Users\Homer\Documents\My Pdfs'

      filelist = os.listdir(path)

      has_text_list = []
      does_not_have_text_list = []

    for pdf_name in filelist:
        pdf_file_with_directory = os.path.join(path, pdf_name)
        pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))
        print pdf_name
        for i in range(0, pdf.getNumPages()):
            try:
                pdf.write("%%EOF")
                content = pdf.getPage(i).extractText()
                does_it_have_text = re.findall(r'\w{2,}', content) 
                if does_it_have_text == []:
                    does_not_have_text_list.append(pdf_name)
                    print pdf_name
                else:
                    has_text_list.append(pdf_name)
            except:
                has_text_list.append(pdf_name)

print does_not_have_text_list

But then I get this error:

pyPdf.utils.PdfReadError: EOF marker not found

Seems like it comes up a lot (from google):

http://pdfposter.origo.ethz.ch/node/31

I think it means that pyPdf opened the file, did its attempt at text processing, raised whatever exception, did the except: block, but is now unable to go to the next step b/c it doesn't know that the file has eneded.

There are other threads like this and they allege that this has been fixed, but it doesn't seem to have been.

Then someone has a function here where they write the EOF character to the .pdf first.

http://code.activestate.com/lists/python-list/589529/

I stuck in the "pdf.write("%%EOF")" line to try to mimick this, but no dice.

So I how do I get that error to run the except block? I'm also using wing IDE so if there's a way to use the debugger to just skip over these files, that would be possible too. Thx.

Upvotes: 2

Views: 5839

Answers (1)

jcomeau_ictx
jcomeau_ictx

Reputation: 38492

put your pyPdf call(s) inside the try/except block also.

Upvotes: 2

Related Questions