Pruthvi Raj
Pruthvi Raj

Reputation: 3036

python magic can't identify unicode filename

In my small project I had to identify the type of files in the directory. So I went with python-magic module and did the following:

from Tkinter import Tk
from tkFileDialog import askdirectory

def getDirInput():
    root = Tk()
    root.withdraw()
    return askdirectory()
di = getDirInput()
print('Selected Directory: ' + di)
for f in os.listdir(di):
    m = magic.Magic(magic_file='magic')
    print 'Type of ' + f + '  -->  ' + m.from_file(f)

But It seems that python-magic can't take unicode filenames as it is when I pass it to the from_file() function.Here's a sample output:

Selected Directory: C:/Users/pruthvi/Desktop/vidrec/temp
Type of log.txt  -->  ASCII English text, with very long lines, with CRLF, CR line terminators
Type of TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4  -->  cannot open `TAEYEON \355\234\227_ I (feat. Verbal Jint)_Music Video.mp4' (No such file or directory)
Type of test.py  -->  a python script text executable

you can observe that python-magic failed to identiy the type of second file TAEYEON... as it had unicode characters in it. It shows 태연 characters as \355\234\227 instead while I passed the same in both cases. How can I overcome this problem and find the type of file with Unicode characters also ? Thank you

Upvotes: 5

Views: 2246

Answers (2)

bobince
bobince

Reputation: 536359

But It seems that python-magic can't take unicode filenames

Correct. In fact most cross-platform software on Windows can't handle non-ASCII characters in filenames.

This is because the C standard library uses byte strings for all filenames but Windows uses Unicode strings (technically, UTF-16 code unit strings, but the difference isn't important here). When software using the C standard library opens a file by byte-based string, the MS C runtime converts that to a Unicode string automatically, using an encoding (the confusingly-named ‘ANSI’ code page) that depends on the locale of the Windows installation. Your ANSI code page is probably 1252, which can't encode Korean characters, so it's impossible to use that filename. The ANSI code page is unfortunately never anything sensible like UTF-8, so it can never include all possible Unicode characters.

Python is special in that it has extra support for Windows Unicode filenames which bypasses the C standard library and calls the underlying Win32 APIs directly for Unicode filenames. So you can pass a unicode string using eg open() and it will work for all filenames.

However python-magic's from_file call doesn't open the file from Python. Instead it passes the filename to the libmagic library which is written in pure C. libmagic doesn't have the special Windows-filename code path for Unicode so this fails.

I suggest opening the file yourself from Python and using magic.from_buffer instead.

Upvotes: 7

Alastair McCormack
Alastair McCormack

Reputation: 27704

The response from the magic module seems to show that your characters were incorrectly translated somewhere - only half the string is shown and the byte order of is wrong - it should be \355\227\234at least.

As this is on Windows, this raises UTF-16 byte-order alarm bells.

It might be possible to work around this by encoding to UTF-16. As suggested by other commenters, you need to prefix the directory.

input_encoding = locale.getpreferredencoding()
u_di = di.decode(input_encoding)
m = magic.Magic(magic_file='magic') # only needs to be initialised once

for f in os.listdir(u_di):
    fq_f = os.path.join(u_di, f)
    utf16_fq_f = fq_f.encode("UTF-16LE")
    print m.from_file(utf16_fq_f)

Upvotes: 2

Related Questions