Abhishek Mukherjee
Abhishek Mukherjee

Reputation: 361

Output difference after reading files saved in different encoding option in python

I have a unicode string list file, saved in encode option utf-8. I have another input file, saved in normal ansi. I read directory path from that ansi file and do os.walk() and try to match if any file present in the list (saved by utf-8). But it is not matching even if it is present.

Later I do some normal checking with a single string "40M_Ãz­µ´ú¸ÕÀÉ" and save this particular string (from notepad) in three different files with encoding option ansi, unicode and utf-8. I write a python script to print:

print repr(string)
print string

And the output is like:

ANSI Encoding

'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9'
40M_Ãz­µ´ú¸ÕÀÉ

UNICODE Encoding

'\x004\x000\x00M\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'
 4 0 M _ Ã z ­µ ´ ú ¸ Õ À É

UTF-8 Encoding

'40M_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'
40M_Ãz­µ´ú¸ÕÀÉ

I really I can't understand how to compare same string coming from differently encoded file. Please help.

PS: I have some typical unicode characters like: 唐朝小栗子第集.mp3 which are very difficult to handle.

Upvotes: 3

Views: 1191

Answers (1)

bobince
bobince

Reputation: 536399

I really I can't understand how to compare same string coming from differently encoded file.

Notepad encoded your character string with three different encodings, resulting in three different byte sequences. To retrieve the character string you must decode those bytes using the same encodings:

>>> ansi_bytes  = '40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9'
>>> utf16_bytes = '4\x000\x00M\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'
>>> utf8_bytes  = '40M_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'

>>> ansi_bytes.decode('mbcs')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
>>> utf16_bytes.decode('utf-16le')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
>>> utf8_bytes.decode('utf-8')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
  • ‘ANSI’ (not “ASCI”) is what Windows (somewhat misleadingly) calls its default locale-specific code page, which in your case is 1252 (Western European, which you can get in Python as windows-1252) but this will vary from machine to machine. You can get whatever this encoding is from Python on Windows using the name mbcs.

  • ‘Unicode’ is the name Windows uses for the UTF-16LE encoding (very misleadingly, because Unicode is the character set standard and not any kind of bytes⇔characters encoding in itself). Unlike ANSI and UTF-8 this is not an ASCII-compatible encoding, so your attempt to read a line from the file has failed because the line terminator in UTF-16LE is not \n, but \n\x00. This has left a spurious \x00 at the start of the byte string you have above.

  • ‘UTF-8’ is at least accurately named, but Windows likes to put fake Byte Order Marks at the front of its “UTF-8” files that will give you an unwanted u'\uFEFF' character when you decode them. If you want to accept “UTF-8” files saved from Notepad you can manually remove this or use Python's utf-8-sig encoding.

You can use codecs.open() instead of open() to read a file with automatic Unicode decoding. This also fixes the UTF-16 newline problem, because then the \n characters are detected after decoding instead of before.

I read directory path from that asci file and do os.walk()

Windows filenames are natively handled as Unicode, so when you give Windows a byte string it has to guess what encoding is needed to convert those bytes into characters. It chooses ANSI not UTF-8. That would be fine if you were using a byte string from a file also encoded in the same machine's ANSI encoding, however in that case you would be limited to filenames that fit within your machine's locale. In Western European 40M_Ãz­µ´ú¸ÕÀÉ would fit but 唐朝小栗子第集.mp3 would not so you wouldn't be able to refer to Chinese files at all.

Python supports passing Unicode filenames directly to Windows, which avoids the problem (most other languages can't do this). Pass a Unicode string into filesystem functions like os.walk() and you should get Unicode strings out, instead of failure.

So, for UTF-8-encoded input files, something like:

with codecs.open(u'directory_path.txt', 'rb', 'utf-8-sig') as fp:
    directory_path = fp.readline().strip(u'\r\n') # unicode dir path

good_names = set()
with codecs.open(u'filename_list.txt', 'rb', 'utf-8-sig') as fp:
    for line in fp:
        good_names.add(line.strip(u'\r\n')) # set of unicode file names

for dirpath, dirnames, filenames in os.walk(directory path): # names will be unicode strings
    for filename in filenames:
        if filename in good_names:
            # do something with file

Upvotes: 3

Related Questions