Python encoding errors latin-1 PyPDF2

Question

I am trying to extract the content of all the pdfs from my directory and print the text from all these pdfs as a txt file. I have managed to do so but issue occurs when I frequently have some pdfs with non latin letters. if someone could tell me how I can modify the code below to avoid the error at the bottom. I have looked into similar questions and tried many solutions - none worked. thank you

import glob
import PyPDF2
pdfs=glob.glob("/private/Documents/*.pdf")

for pdf in pdfs:
    with open(pdf, 'rb') as pdfFileObj:
        
        # creating a pdf reader object
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
        print(pdfReader.numPages)
        pageObj = pdfReader.getPage(0)
        gg = pageObj.extractText()
        print(gg)
        utxt = str(gg)
        print(utxt)
        stxt = utxt.encode('latin-1', 'ignore')
        print(stxt)

with open('quotes.txt', 'w', encoding='utf-8') as f:
    f.write(utxt)

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0445' in position 0: ordinal not in range(256)

Python encoding errors latin-1 PyPDF2

Answers (0)

Related Questions