paulowe
paulowe

Reputation: 143

What is causing a UnicodeDecodeError when trying to read a text file?

I am trying to execute this code snippet in python 3.8

 def load_rightprob(self, rightprob_file):
        ''' dictionary with # people keys with # actions  '''
        rightProb = {}
        for line in open(rightprob_file):
            items = line.strip().split("\t")
            if len(items) != len(self.action_qid_dict) + 1:
                continue
            pid = int(items[0])


but I get this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I tried for line in open(rightprob_file, **'rb'**): instead but I get challenges on the following line with this error:

TypeError: a bytes-like object is required, not 'str' 

Can somebody please suggest how to fix this? I am reading from a .txt file where each line is an ID, followed by 377 columns representing probability values associated with this ID

enter image description here Thanks.

Upvotes: 0

Views: 1053

Answers (1)

Mark Ransom
Mark Ransom

Reputation: 308530

It's very unusual for a text file to start with 0xff. Because of that, it's sometimes placed deliberately at the start of the file as part of a Byte Order Mark (BOM) for Unicode, particularly on Windows. As you can see in the table in the link, only two Unicode encodings have a BOM that starts with 0xff: UTF-16 or UTF-32, both little endian. Of the two UTF-16 is far more commonly encountered.

So open your file like this:

with open(rightprob_file, 'r', encoding='utf_16_le') as f:
    for line in f:

I added the with so that the file would be automatically closed when you're done, that was a bug in your original code.

The first character read from the file will be u'\ufeff' and can be thrown away or otherwise ignored.

Upvotes: 1

Related Questions