Luca
Luca

Reputation: 1810

UnicodeDecodeError when reading file only on unix system

I'm trying to read a (large) text file using python 3.7. I'm trivially doing:

with open(filename,'r') as f:
    for il,l in enumerate(f,il):
        %do things

this works perfectly if I run the script from Spyder's IPython console on windows.

However if I run the exact same script to read the exact same file (not a copy!) from a unix server, i get the following error:

  File "/net/atgcls01/data2/j02660606/code/freeGSA.py", line 127, in read_gwa
    for il,l in enumerate(f,il):
  File "/u/lon/lamerio/.conda/envs/la3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2099: invalid start byte

I tried to find the culprit to understand what is going on. I did:

bytes = []
fobj = open(settings['GSA_file'],'rb')
for i in range(3000):
    b = fobj.read(1)
    bytes.append((i, b, b.hex()))

fobj.close()
bytes[2095:2105]

the output is

[(2095, b'0', '30'), (2096, b'0', '30'), (2097, b' ', '20'), (2098, b't', '74'), (2099, b'o', '6f'), (2100, b' ', '20'), (2101, b'5', '35'), (2102, b'6', '36'), (2103, b'1', '31'), (2104, b' ', '20')]

I don't see any 0xb0 character in position 2099. Indeed position 2098 is 0x74, position 2099 is 0x6f and position 2100 is 0x20. These translates to the valid utf-8 characters 't','o',' '(space) that are indeed in position 2099 in the file.

How can I solve that error and why does it arise only on the unix machine?

EDIT: Running

import sys
sys.getdefaultencoding()

returnb 'utf-8' on both systems.

PS: On windows I have version 3.7.5, while on unix I have 3.7.4.

Upvotes: 1

Views: 309

Answers (3)

TomMP
TomMP

Reputation: 815

On the unix machine, try

with open(filename, encoding='latin-1') as f: ...

or

with open(filename, encoding='windows-1252') as f: ...

Edit: Windows has a different default encoding than UNIX (usually). I assume you edited/created the files on your windows machine. You can also open one of those files, I believe using Notepad, and it will show you the encoding in the bottom right corner. I might be wrong about this, as I'm recalling it from memory. In any case, that's the encoding you want to specify on your UNIX machine. But go ahead and try with the two encodings I have specified.

Upvotes: 1

Grzegorz Bokota
Grzegorz Bokota

Reputation: 1804

The problem may be with default encoding. If windows it may not be utf8 but some windows encoding. In Poland the default encoding is cp1250 and such code will work.

with open(filename,'r', enccoding="cp1250") as f:
    for il,l in enumerate(f,il):
        %do things

Upvotes: 1

Ali k.
Ali k.

Reputation: 27

It's a Unicode character, you can use "unidecode" module to decode it. It will work great.

You can read more about it here: https://pypi.org/project/Unidecode/

Upvotes: 0

Related Questions