Reputation: 53
I have a PHP script that creates a list of files in a directory, however, PHP can see only file names in English and totally ignores file names in other languages, such as Russian or Asian languages.
After lots of efforts I found the only solution that could work for me - using a python script that renames the files to UTF8, so the PHP script can process them after that.
(After PHP has finished processing the files, I rename the files to English, I don't keep them in UTF8).
I used the following python script, that works fine:
import sys
import os
import glob
import ntpath
from random import randint
for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
if os.path.isfile(infile):
infile_utf8 = infile.encode('utf8')
os.rename(infile, infile_utf8)
The problem is that it converts also file names that are already in UTF8. I need a way to skip the conversion in case the file name is already in UTF8.
I was trying this python script:
for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
if os.path.isfile(infile):
try:
infile.decode('UTF-8', 'strict')
except UnicodeDecodeError:
infile_utf8 = infile.encode('utf8')
os.rename(infile, infile_utf8)
But, if file name is already in utf8, I get fatal error:
UnicodeDecodeError: 'ascii' codec can't decode characters in position 18-20
ordinal not in range(128)
I also tried another way, which also didn't work:
for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
if os.path.isfile(infile):
try:
tmpstr = str(infile)
except UnicodeDecodeError:
infile_utf8 = infile.encode('utf8')
os.rename(infile, infile_utf8)
I got exactly the same error as before.
Any ideas?
Python is very new to me, and it is a huge effort for me to debug even a simple script, so please write an explicit answer (i.e. code). I don't have the ability of testing general ideas that maybe work or maybe not. Thanks.
Examples of file names:
hello.txt
你好.txt
안녕하세요.html
chào.doc
Upvotes: 5
Views: 12873
Reputation: 27744
I think you're confusing your terminology and making some wrong assumptions. AFAIK, PHP can open filenames of any encoding type - PHP is very much agnostic about encoding types.
You haven't been clear exactly what you want to achieve as UTF-8 != English and the example foreign filenames could be encoded in a number of ways but never in ASCII English! Can you explain what you think an existing UTF-8 file looks like and what a non-UTF-8 file is?
To add to your confusion, under Windows, filenames are transparently stored as UTF-16. Therefore, you should not try to encode to filenames to UTF-8. Instead, you should use Unicode strings and allow Python to work out the proper conversion. (Don't encode in UTF-16 either!)
Please clarify your question further.
Update:
I now understand your problem with PHP. http://evertpot.com/filesystem-encoding-and-php/ tells us that non-latin characters are troublesome with PHP+Windows. It would seem that only files that are made of Windows 1252 character set characters can be seen and opened.
The challenge you have is to convert your filenames to be Windows 1252 compatible. As you've stated in your question, it would be ideal not to rename files that are already compatible. I've reworked your attempt as:
import os
from glob import glob
import shutil
import urllib
files = glob(u'*.txt')
for my_file in files:
try:
print "File %s" % my_file
except UnicodeEncodeError:
print "File (escaped): %s" % my_file.encode("unicode_escape")
new_name = my_file
try:
my_file.encode("cp1252" , "strict")
print " Name unchanged. Copying anyway"
except UnicodeEncodeError:
print " Can not convert to cp1252"
utf_8_name = my_file.encode("UTF-8")
new_name = urllib.quote(utf_8_name )
print " New name: (%% encoded): %s" % new_name
shutil.copy2(my_file, os.path.join("fixed", new_name))
breakdown:
Print filename. By default, the Windows shell only shows results in a local DOS code page. For example, my shell can show ü.txt
but €.txt
shows as ?.txt
. Therefore, you need to be careful of Python throwing Exceptions because it can't print properly. This code, attempts to print the Unicode version but resorts to print Unicode code point escapes instead.
Try to encode string as Windows-1252. If this works, filename is ok
Else: Convert the filename to UTF-8, then percent encode it. This way, the filename remains unique and you could reverse this procedure in PHP.
Copy file to new/verified file.
For example, 你好.txt becomes %E4%BD%A0%E5%A5%BD.txt
Upvotes: 4
Reputation: 433
For all UTF-8 issues with Python, I warmly recommand spending 36 minutes watching the "Pragmatic Unicode" by Ned Batchelder (http://nedbatchelder.com/text/unipain.html) at PyCon 2012. For me it was a revelation ! A lot from this presentation is in fact not Python-specific but helps understanding important things like the difference between Unicode strings and UTF-8 encoded bytes...
The reason I'm recommending this video to you (like I did for many friends) is because some your code contains contradictions like trying to decode
and then encode
if decoding fails : such methods cannot apply to the same object ! Even though in Python2 it's syntaxically possible possible, it makes no sense, and in Python 3, the disctinction between bytes
and str
makes things clearer:
A str
object can be encoded in bytes
:
>>> a = 'a'
>>> type(a)
<class 'str'>
>>> a.encode
<built-in method encode of str object at 0x7f1f6b842c00>
>>> a.decode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
...while a bytes
object can be decoded in str
:
>>> b = b'b'
>>> type(b)
<class 'bytes'>
>>> b.decode
<built-in method decode of bytes object at 0x7f1f6b79ddc8>
>>> b.encode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
Coming back to your question of working with filenames, the tricky question you need to answer is: "what is the encoding of your filenames". The language doesn't matter, only the encoding !
Upvotes: 3