Reputation: 553
pyPdf throws this exception:
pyPdf.utils.PdfReadError: EOF marker not found
I don't need to fix pyPdf, I just need to get the EOF error to cause an "except" block to execute and skip over the file, but it doesn't work. It still causes the program to stop running.
Background:
Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode
... the saga continues.
I got 10,000 pdfs in a folder. Some OCRd, some not. Can't tell 'em apart. Step 1 was to figure out which ones are not OCRd and OCR only those (see other threads for details).
So i'm using pyPdf. I get some exceptions related to unrecognized characters and unsupported filters when I try to Read the text. So I guestimated that if it throws an exception, it's got some text in it and then it doens't go in the list. Problem solved, right? Like so:
from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re
path = 'C:\Users\Homer\Documents\My Pdfs'
filelist = os.listdir(path)
has_text_list = []
does_not_have_text_list = []
for pdf_name in filelist:
pdf_file_with_directory = os.path.join(path, pdf_name)
pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))
print pdf_name
for i in range(0, pdf.getNumPages()):
try:
pdf.write("%%EOF")
content = pdf.getPage(i).extractText()
does_it_have_text = re.findall(r'\w{2,}', content)
if does_it_have_text == []:
does_not_have_text_list.append(pdf_name)
print pdf_name
else:
has_text_list.append(pdf_name)
except:
has_text_list.append(pdf_name)
print does_not_have_text_list
But then I get this error:
pyPdf.utils.PdfReadError: EOF marker not found
Seems like it comes up a lot (from google):
http://pdfposter.origo.ethz.ch/node/31
I think it means that pyPdf opened the file, did its attempt at text processing, raised whatever exception, did the except: block, but is now unable to go to the next step b/c it doesn't know that the file has eneded.
There are other threads like this and they allege that this has been fixed, but it doesn't seem to have been.
Then someone has a function here where they write the EOF character to the .pdf first.
http://code.activestate.com/lists/python-list/589529/
I stuck in the "pdf.write("%%EOF")" line to try to mimick this, but no dice.
So I how do I get that error to run the except block? I'm also using wing IDE so if there's a way to use the debugger to just skip over these files, that would be possible too. Thx.
Upvotes: 2
Views: 5839