Reputation: 5174
I'm using the PyPDF2 library to extract text from PDF files via its extractText
function, and for most PDFs, it works great!
However, some PDFs produce text that looks like this:
\n!"#$%&'()"+,"-.".)/"0$-1"2)+3-$.45\n""!"#$%&'()#'+),$!"#-.#$-/$0.1+"#+12$\n!"#"$!%"&#"%$'$()%+,-$(%.($#"$(%"&#%/%0!%\n$0"&$(%1(0,$2%3(%0"%0!%"&$%1(34+5"%36%1(0,$!7\n%%8%!"#$%&'($)%"\n%0!%#%+,-$(%"&#"%0!%3*9)%40'0!0-9$%-)%/%#*4%0"!$967\n%%:%0!%"&$%3*9)%$'$%\n1(0,$%+,-$(7\n%%;3%099+!"(#"$%6+4#,$"#9%"&$3($,%36%#(0"&,$"052%<%90!"%-$93=%"&$%1(0,$%6#5"3(0>#"03*%\n36%+,-$(!%-$"=$$%/%#4%:?7%@(0,$%+,-$(!%#($%0*%\n6.'78"AB%,$#*!%,+9"019)7C\n%"/D%E$0"&$(%1(0,$%*3(%53,13!0"$7%\n%:D%9%%%%%%%/FD%:BG\n%HD%:%%%%%%%/?D%HB?\n%%FD%:B:\n%3(
According to the docs, this be expected:
This works well for some PDF files, but poorly for others, depending on the generator used.
Unfortunately, the extractText()
function doesn't raise any exceptions when it outputs text like the above.
So, my question is, is there a way to programmatically detect when the extractText()
function returns gibberish?
Upvotes: 1
Views: 438
Reputation: 5174
Based on @DYZ's comment, here's the solution.
document_path
is assumed to the path to the PDF
file you're opening. The rest should be pretty self-explanatory.
from PyPDF2 import PdfFileReader
from nltk.corpus import words
words = words.words()
document_file = PdfFileReader(open(document_path, 'rb'))
num_pages = document_file.getNumPages()
for page_num in range(0, num_pages):
page = document_file.getPage(page_num)
page_contents = page.extractText()
if set(page_contents.lower().split()).intersection(words):
# process page_contents
Upvotes: 1