Reputation: 3036
In my small project I had to identify the type of files in the directory. So I went with python-magic
module and did the following:
from Tkinter import Tk
from tkFileDialog import askdirectory
def getDirInput():
root = Tk()
root.withdraw()
return askdirectory()
di = getDirInput()
print('Selected Directory: ' + di)
for f in os.listdir(di):
m = magic.Magic(magic_file='magic')
print 'Type of ' + f + ' --> ' + m.from_file(f)
But It seems that python-magic
can't take unicode filenames as it is when I pass it to the from_file()
function.Here's a sample output:
Selected Directory: C:/Users/pruthvi/Desktop/vidrec/temp
Type of log.txt --> ASCII English text, with very long lines, with CRLF, CR line terminators
Type of TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4 --> cannot open `TAEYEON \355\234\227_ I (feat. Verbal Jint)_Music Video.mp4' (No such file or directory)
Type of test.py --> a python script text executable
you can observe that python-magic
failed to identiy the type of second file TAEYEON...
as it had unicode characters in it. It shows 태연
characters as \355\234\227
instead while I passed the same in both cases. How can I overcome this problem and find the type of file with Unicode characters also ? Thank you
Upvotes: 5
Views: 2246
Reputation: 536359
But It seems that python-magic can't take unicode filenames
Correct. In fact most cross-platform software on Windows can't handle non-ASCII characters in filenames.
This is because the C standard library uses byte strings for all filenames but Windows uses Unicode strings (technically, UTF-16 code unit strings, but the difference isn't important here). When software using the C standard library opens a file by byte-based string, the MS C runtime converts that to a Unicode string automatically, using an encoding (the confusingly-named ‘ANSI’ code page) that depends on the locale of the Windows installation. Your ANSI code page is probably 1252, which can't encode Korean characters, so it's impossible to use that filename. The ANSI code page is unfortunately never anything sensible like UTF-8, so it can never include all possible Unicode characters.
Python is special in that it has extra support for Windows Unicode filenames which bypasses the C standard library and calls the underlying Win32 APIs directly for Unicode filenames. So you can pass a unicode string using eg open()
and it will work for all filenames.
However python-magic
's from_file
call doesn't open the file from Python. Instead it passes the filename to the libmagic
library which is written in pure C. libmagic
doesn't have the special Windows-filename code path for Unicode so this fails.
I suggest opening the file yourself from Python and using magic.from_buffer
instead.
Upvotes: 7
Reputation: 27704
The response from the magic module seems to show that your characters were incorrectly translated somewhere - only half the string is shown and the byte order of 태
is wrong - it should be \355\227\234
at least.
As this is on Windows, this raises UTF-16 byte-order alarm bells.
It might be possible to work around this by encoding to UTF-16. As suggested by other commenters, you need to prefix the directory.
input_encoding = locale.getpreferredencoding()
u_di = di.decode(input_encoding)
m = magic.Magic(magic_file='magic') # only needs to be initialised once
for f in os.listdir(u_di):
fq_f = os.path.join(u_di, f)
utf16_fq_f = fq_f.encode("UTF-16LE")
print m.from_file(utf16_fq_f)
Upvotes: 2