Nathan Jones
Nathan Jones

Reputation: 5174

How do I detect when a pdf's text was successfully extracted with PyPDF2.extractText?

I'm using the PyPDF2 library to extract text from PDF files via its extractText function, and for most PDFs, it works great!

However, some PDFs produce text that looks like this:

\n!"#$%&'()"+,"-.".)/"0$-1"2)+3-$.45\n""!"#$%&'()#'+),$!"#-.#$-/$0.1+"#+12$\n!"#"$!%"&#"%$'$()%+,-$(%.($#"$(%"&#%/%0!%\n$0"&$(%1(0,$2%3(%0"%0!%"&$%1(34+5"%36%1(0,$!7\n%%8%!"#$%&'($)%"\n%0!%#%+,-$(%"&#"%0!%3*9)%40'0!0-9$%-)%/%#*4%0"!$967\n%%:%0!%"&$%3*9)%$'$%\n1(0,$%+,-$(7\n%%;3%099+!"(#"$%6+4#,$"#9%"&$3($,%36%#(0"&,$"052%<%90!"%-$93=%"&$%1(0,$%6#5"3(0>#"03*%\n36%+,-$(!%-$"=$$%/%#4%:?7%@(0,$%+,-$(!%#($%0*%\n6.'78"AB%,$#*!%,+9"019)7C\n%"/D%E$0"&$(%1(0,$%*3(%53,13!0"$7%\n%:D%9%%%%%%%/FD%:BG\n%HD%:%%%%%%%/?D%HB?\n%%FD%:B:\n%3(

According to the docs, this be expected:

This works well for some PDF files, but poorly for others, depending on the generator used.

Unfortunately, the extractText() function doesn't raise any exceptions when it outputs text like the above.

So, my question is, is there a way to programmatically detect when the extractText() function returns gibberish?

Upvotes: 1

Views: 438

Answers (1)

Nathan Jones
Nathan Jones

Reputation: 5174

Based on @DYZ's comment, here's the solution.

document_path is assumed to the path to the PDF file you're opening. The rest should be pretty self-explanatory.

from PyPDF2 import PdfFileReader
from nltk.corpus import words

words = words.words()
document_file = PdfFileReader(open(document_path, 'rb'))
num_pages = document_file.getNumPages()
for page_num in range(0, num_pages):
    page = document_file.getPage(page_num)
    page_contents = page.extractText()
    if set(page_contents.lower().split()).intersection(words):
        # process page_contents

Upvotes: 1

Related Questions