Reputation: 13
I'm using "pdfminer.six", a python library, to extract all text from a few PDFs I have. My method works flawlessly, but with some pdfs, that probably have a few characters in special, when I'm writing it to a text file, I get "Unicode Encode Error: 'charmap' codec can't encode character '\u03b2' in position 271130: character maps to ". Now, I know what "is" happening, but I'd like to know how to treat it the best way. This is the part that is giving me a headache:
with open("newTxtFile.txt", "w") as textFile:
textFile.write(text)
Since I'm from Brazil and the text is in portuguese, I want to keep all accentuation, so I use "codec = 'latin-1'" with pdfminer. Printing before saving, as far as I could check, works flawlessly until the end, but whenever I try saving to file, I get UnicodeEncodeError.
My two options that I thought about are: Either I find a way to catch only the specific character that is giving me trouble:
with open("newTxtFile.txt", "w") as textFile:
try:
textFile.write(text)
except UnicodeEncodeError:
????
But I don't know what should be in the except?
Or I should save differently to the file.
Can anyone give me a few tips? Many thanks in advance!
Upvotes: 1
Views: 1628
Reputation: 2645
try:
with open("newTxtFile.txt", "wb") as textFile:
textFile.write(text.encode('utf8'))
to read it:
with open("newTxtFile.txt", "rb") as textFile:
text = textFile.read().decode('utf8')
Upvotes: 3