How to read a utf-8 encoded text file using Python

Question

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?

corpus = open('C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt').read()

Traceback (most recent call last):
  File "", line 1, in 
    corpus = open('C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt').read()
  File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to

Antonis Christofides · Accepted Answer

Since you are using Python 3, just add the encoding parameter to open():

corpus = open(
    r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

How to read a utf-8 encoded text file using Python

Answers (1)

Related Questions