Reputation: 1981
I'm having some issues with a Python script that needs to open files with different encoding.
I'm usually using this:
with open(path_to_file, 'r') as f:
first_line = f.readline()
And that works great when the file is properly encode.
But sometimes, it doesn't work, for example with this file, I've got this:
In [22]: with codecs.open(filename, 'r') as f:
...: a = f.readline()
...: print(a)
...: print(repr(a))
...:
��Test for StackOverlow
'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
And I would like to search some stuff on those lines. Sadly with that method, I can't:
In [24]: "Test" in a
Out[24]: False
I've found a lot of questions here referring to the same type of issues:
But can't manage to decode the file properly with them...
With codecs.open():
In [17]: with codecs.open(filename, 'r', "utf-8") as f:
a = f.readline()
print(a)
....:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-17-0e72208eaac2> in <module>()
1 with codecs.open(filename, 'r', "utf-8") as f:
----> 2 a = f.readline()
3 print(a)
4
/usr/lib/python2.7/codecs.pyc in readline(self, size)
688 def readline(self, size=None):
689
--> 690 return self.reader.readline(size)
691
692 def readlines(self, sizehint=None):
/usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
543 # If size is given, we call read() only once
544 while True:
--> 545 data = self.read(readsize, firstline=True)
546 if data:
547 # If we're at a "\r" read one extra character (which might
/usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
490 data = self.bytebuffer + newdata
491 try:
--> 492 newchars, decodedbytes = self.decode(data, self.errors)
493 except UnicodeDecodeError, exc:
494 if firstline:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
with encode('utf-8):
In [18]: with codecs.open(filename, 'r') as f:
a = f.readline()
print(a)
....: a.encode('utf-8')
....: print(a)
....:
��Test for StackOverlow
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-7facc05b9cb1> in <module>()
2 a = f.readline()
3 print(a)
----> 4 a.encode('utf-8')
5 print(a)
6
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I've found a way to change file encoding automatically with Vim:
system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)
But I would like to do this without using Vim...
Any help will be appreciate.
Upvotes: 2
Views: 3085
Reputation: 114038
it looks like this is utf-16-le (utf-16 little endian ...) but you are missing a final \x00
>>> s = '\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x
00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
>>> s.decode('utf-16-le') # creates error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\encodings\utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncat
ed data
>>> (s+"\x00").decode("utf-16-le") # TADA!!!!
u'\ufeffTest for StackOverlow\r\n'
>>>
Upvotes: 6
Reputation: 37113
It looks like you need to detect the encoding in the input file. The chardet
library mentioned in the answer to this question might help (though note the proviso that complete encoding detection is not possible).
Then you can write the file out in a known encoding, perhaps. When dealing with Unicode remember that it MUST be encoded into a suitable bytestream before being communicated outside the process. Decode on input, then encode on output.
Upvotes: 5