MSepehr
MSepehr

Reputation: 970

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 34: invalid continuation byte

I wanna open some text file in Persian language in python file with bellow code:

 for line in codecs.open('0001.txt',encoding='UTF-8'):
       lines.appends(line)

but it gives me this error :

> Traceback (most recent call last):
  File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/nlpuser/Documents/ms/Work/General_Dataset_creator/BijanKhanReader.py", line 24, in <module>
    for lin in codecs.open('corpuses/markaz/0001.txt',encoding='UTF-8'):
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 713, in __next__
    return next(self.reader)
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 644, in __next__
    line = self.readline()
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 557, in readline
    data = self.read(readsize, firstline=True)
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 503, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: invalid continuation byte

what is wrong with this code ?

and his is the output for file :

0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators

Upvotes: 2

Views: 8544

Answers (1)

Amadan
Amadan

Reputation: 198334

UTF-8 has a very specific format, given that a character can be represented by anywhere from one to four bytes.

If a character is single-byte, it will be represented by 0x00-0x7F. If it is represented by two or more, the leading byte will start with 0xC2 to 0xF4, followed by one to three continuation bytes, in range of 0x80 to 0xBF.

In your case, Python found a character that is in the position of a continuation character (i.e. one of the characters following the lead character), but is 0xE3, which is not a legal continuation character. The problem is likely in your text file, not in your program - either bad encoding, or wrong encoding.

Use hexdump -C <file> or xxd <file> to verify what exact sequence of bytes you have and file <file> to try to guess the encoding, and we might be able to say more.

Upvotes: 3

Related Questions