How can Python check if a file name is in UTF8?

Question

I have a PHP script that creates a list of files in a directory, however, PHP can see only file names in English and totally ignores file names in other languages, such as Russian or Asian languages.

After lots of efforts I found the only solution that could work for me - using a python script that renames the files to UTF8, so the PHP script can process them after that.

(After PHP has finished processing the files, I rename the files to English, I don't keep them in UTF8).

I used the following python script, that works fine:

import sys
import os
import glob
import ntpath
from random import randint

for infile in glob.glob( os.path.join('C:\MyFiles', u'*') ):
    if os.path.isfile(infile):
      infile_utf8 = infile.encode('utf8')
      os.rename(infile, infile_utf8)

The problem is that it converts also file names that are already in UTF8. I need a way to skip the conversion in case the file name is already in UTF8.

I was trying this python script:

for infile in glob.glob( os.path.join('C:\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        infile.decode('UTF-8', 'strict')
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

But, if file name is already in utf8, I get fatal error:

UnicodeDecodeError: 'ascii' codec can't decode characters in position 18-20
ordinal not in range(128)

I also tried another way, which also didn't work:

for infile in glob.glob( os.path.join('C:\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        tmpstr = str(infile)
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

I got exactly the same error as before.

Any ideas?

Python is very new to me, and it is a huge effort for me to debug even a simple script, so please write an explicit answer (i.e. code). I don't have the ability of testing general ideas that maybe work or maybe not. Thanks.

Examples of file names:

 hello.txt
 你好.txt
 안녕하세요.html
 chào.doc

Alastair McCormack · Accepted Answer

I think you're confusing your terminology and making some wrong assumptions. AFAIK, PHP can open filenames of any encoding type - PHP is very much agnostic about encoding types.

You haven't been clear exactly what you want to achieve as UTF-8 != English and the example foreign filenames could be encoded in a number of ways but never in ASCII English! Can you explain what you think an existing UTF-8 file looks like and what a non-UTF-8 file is?

To add to your confusion, under Windows, filenames are transparently stored as UTF-16. Therefore, you should not try to encode to filenames to UTF-8. Instead, you should use Unicode strings and allow Python to work out the proper conversion. (Don't encode in UTF-16 either!)

Please clarify your question further.

Update:

I now understand your problem with PHP. http://evertpot.com/filesystem-encoding-and-php/ tells us that non-latin characters are troublesome with PHP+Windows. It would seem that only files that are made of Windows 1252 character set characters can be seen and opened.

The challenge you have is to convert your filenames to be Windows 1252 compatible. As you've stated in your question, it would be ideal not to rename files that are already compatible. I've reworked your attempt as:

import os
from glob import glob
import shutil
import urllib

files = glob(u'*.txt')
for my_file in files:
    try:
        print "File %s" % my_file
    except UnicodeEncodeError:
        print "File (escaped): %s" % my_file.encode("unicode_escape")
    new_name = my_file
    try:
        my_file.encode("cp1252" , "strict")
        print "    Name unchanged. Copying anyway"
    except UnicodeEncodeError:
        print "    Can not convert to cp1252"
        utf_8_name = my_file.encode("UTF-8")
        new_name = urllib.quote(utf_8_name )
        print "    New name: (%% encoded): %s" % new_name
    
    shutil.copy2(my_file, os.path.join("fixed", new_name))

breakdown:

Print filename. By default, the Windows shell only shows results in a local DOS code page. For example, my shell can show ü.txt but €.txt shows as ?.txt. Therefore, you need to be careful of Python throwing Exceptions because it can't print properly. This code, attempts to print the Unicode version but resorts to print Unicode code point escapes instead.
Try to encode string as Windows-1252. If this works, filename is ok
Else: Convert the filename to UTF-8, then percent encode it. This way, the filename remains unique and you could reverse this procedure in PHP.
Copy file to new/verified file.

For example, 你好.txt becomes %E4%BD%A0%E5%A5%BD.txt

How can Python check if a file name is in UTF8?

Answers (2)

Related Questions