fallremix
fallremix

Reputation: 13

Python UnicodeEncodeError when writing to file

I'm using "pdfminer.six", a python library, to extract all text from a few PDFs I have. My method works flawlessly, but with some pdfs, that probably have a few characters in special, when I'm writing it to a text file, I get "Unicode Encode Error: 'charmap' codec can't encode character '\u03b2' in position 271130: character maps to ". Now, I know what "is" happening, but I'd like to know how to treat it the best way. This is the part that is giving me a headache:

    with open("newTxtFile.txt", "w") as textFile:
        textFile.write(text)

Since I'm from Brazil and the text is in portuguese, I want to keep all accentuation, so I use "codec = 'latin-1'" with pdfminer. Printing before saving, as far as I could check, works flawlessly until the end, but whenever I try saving to file, I get UnicodeEncodeError.

My two options that I thought about are: Either I find a way to catch only the specific character that is giving me trouble:

    with open("newTxtFile.txt", "w") as textFile:
    try:
        textFile.write(text)
    except UnicodeEncodeError:
        ????

But I don't know what should be in the except?

Or I should save differently to the file.

Can anyone give me a few tips? Many thanks in advance!

Upvotes: 1

Views: 1628

Answers (1)

sp________
sp________

Reputation: 2645

try:

with open("newTxtFile.txt", "wb") as textFile:
    textFile.write(text.encode('utf8'))

to read it:

with open("newTxtFile.txt", "rb") as textFile:
    text = textFile.read().decode('utf8')

Upvotes: 3

Related Questions