Reputation: 1810
I'm trying to read a (large) text file using python 3.7. I'm trivially doing:
with open(filename,'r') as f:
for il,l in enumerate(f,il):
%do things
this works perfectly if I run the script from Spyder's IPython console on windows.
However if I run the exact same script to read the exact same file (not a copy!) from a unix server, i get the following error:
File "/net/atgcls01/data2/j02660606/code/freeGSA.py", line 127, in read_gwa
for il,l in enumerate(f,il):
File "/u/lon/lamerio/.conda/envs/la3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2099: invalid start byte
I tried to find the culprit to understand what is going on. I did:
bytes = []
fobj = open(settings['GSA_file'],'rb')
for i in range(3000):
b = fobj.read(1)
bytes.append((i, b, b.hex()))
fobj.close()
bytes[2095:2105]
the output is
[(2095, b'0', '30'), (2096, b'0', '30'), (2097, b' ', '20'), (2098, b't', '74'), (2099, b'o', '6f'), (2100, b' ', '20'), (2101, b'5', '35'), (2102, b'6', '36'), (2103, b'1', '31'), (2104, b' ', '20')]
I don't see any 0xb0 character in position 2099. Indeed position 2098 is 0x74, position 2099 is 0x6f and position 2100 is 0x20. These translates to the valid utf-8 characters 't','o',' '(space) that are indeed in position 2099 in the file.
How can I solve that error and why does it arise only on the unix machine?
EDIT: Running
import sys
sys.getdefaultencoding()
returnb 'utf-8'
on both systems.
PS: On windows I have version 3.7.5, while on unix I have 3.7.4.
Upvotes: 1
Views: 309
Reputation: 815
On the unix machine, try
with open(filename, encoding='latin-1') as f: ...
or
with open(filename, encoding='windows-1252') as f: ...
Edit: Windows has a different default encoding than UNIX (usually). I assume you edited/created the files on your windows machine. You can also open one of those files, I believe using Notepad, and it will show you the encoding in the bottom right corner. I might be wrong about this, as I'm recalling it from memory. In any case, that's the encoding you want to specify on your UNIX machine. But go ahead and try with the two encodings I have specified.
Upvotes: 1
Reputation: 1804
The problem may be with default encoding. If windows it may not be utf8 but some windows encoding. In Poland the default encoding is cp1250
and such code will work.
with open(filename,'r', enccoding="cp1250") as f:
for il,l in enumerate(f,il):
%do things
Upvotes: 1
Reputation: 27
It's a Unicode character, you can use "unidecode" module to decode it. It will work great.
You can read more about it here: https://pypi.org/project/Unidecode/
Upvotes: 0