Reputation: 11
I am trying to extract the content of all the pdfs from my directory and print the text from all these pdfs as a txt file. I have managed to do so but issue occurs when I frequently have some pdfs with non latin letters. if someone could tell me how I can modify the code below to avoid the error at the bottom. I have looked into similar questions and tried many solutions - none worked. thank you
import glob
import PyPDF2
pdfs=glob.glob("/private/Documents/*.pdf")
for pdf in pdfs:
with open(pdf, 'rb') as pdfFileObj:
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
gg = pageObj.extractText()
print(gg)
utxt = str(gg)
print(utxt)
stxt = utxt.encode('latin-1', 'ignore')
print(stxt)
with open('quotes.txt', 'w', encoding='utf-8') as f:
f.write(utxt)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0445' in position 0: ordinal not in range(256)
Upvotes: 1
Views: 497