Xavier C.
Xavier C.

Reputation: 1981

How to handle unknow encoding

I'm having some issues with a Python script that needs to open files with different encoding.

I'm usually using this:

with open(path_to_file, 'r') as f:
    first_line = f.readline()

And that works great when the file is properly encode.

But sometimes, it doesn't work, for example with this file, I've got this:

In [22]: with codecs.open(filename, 'r') as f:
    ...:    a = f.readline()
    ...:    print(a)
    ...:    print(repr(a))
    ...:     
��Test for StackOverlow

'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'

And I would like to search some stuff on those lines. Sadly with that method, I can't:

In [24]: "Test" in a
Out[24]: False

I've found a lot of questions here referring to the same type of issues:

  1. Unicode (UTF-8) reading and writing to files in Python
  2. UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
  3. https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file
  4. how can i escape '\xff\xfe' to a readable string

But can't manage to decode the file properly with them...

With codecs.open():

In [17]: with codecs.open(filename, 'r', "utf-8") as f:
    a = f.readline()
    print(a)
   ....:     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-17-0e72208eaac2> in <module>()
      1 with codecs.open(filename, 'r', "utf-8") as f:
----> 2     a = f.readline()
      3     print(a)
      4 

/usr/lib/python2.7/codecs.pyc in readline(self, size)
    688     def readline(self, size=None):
    689 
--> 690         return self.reader.readline(size)
    691 
    692     def readlines(self, sizehint=None):

/usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
    543         # If size is given, we call read() only once
    544         while True:
--> 545             data = self.read(readsize, firstline=True)
    546             if data:
    547                 # If we're at a "\r" read one extra character (which might

/usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
    490             data = self.bytebuffer + newdata
    491             try:
--> 492                 newchars, decodedbytes = self.decode(data, self.errors)
    493             except UnicodeDecodeError, exc:
    494                 if firstline:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

with encode('utf-8):

In [18]: with codecs.open(filename, 'r') as f:
    a = f.readline()
    print(a)
   ....:     a.encode('utf-8')
   ....:     print(a)
   ....:     
��Test for StackOverlow

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-7facc05b9cb1> in <module>()
      2     a = f.readline()
      3     print(a)
----> 4     a.encode('utf-8')
      5     print(a)
      6 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

I've found a way to change file encoding automatically with Vim:

system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)

But I would like to do this without using Vim...

Any help will be appreciate.

Upvotes: 2

Views: 3085

Answers (2)

Joran Beasley
Joran Beasley

Reputation: 114038

it looks like this is utf-16-le (utf-16 little endian ...) but you are missing a final \x00

>>> s = '\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x
00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
>>> s.decode('utf-16-le') # creates error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\encodings\utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncat
ed data
>>> (s+"\x00").decode("utf-16-le") # TADA!!!!
u'\ufeffTest for StackOverlow\r\n'
>>>

Upvotes: 6

holdenweb
holdenweb

Reputation: 37113

It looks like you need to detect the encoding in the input file. The chardet library mentioned in the answer to this question might help (though note the proviso that complete encoding detection is not possible).

Then you can write the file out in a known encoding, perhaps. When dealing with Unicode remember that it MUST be encoded into a suitable bytestream before being communicated outside the process. Decode on input, then encode on output.

Upvotes: 5

Related Questions