Reputation: 1429
I have a file that has a Unicode name, say 'קובץ.txt'
. I want to pack him, and I'm using python's zipfile.
I can zip the files and open them later on with a problem except that file names are messed up when using windows 7 file explorer to view the files (7zip works great).
According to the docs, this is a common problem, and there are instructions on how to deal with that:
From ZipFile.write
Note
There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin.
Sorry, but I can't seem to get what exactly am I supposed to do with the filename. I've tried .encode('CP437')
, .decode('CP437')
..
Upvotes: 1
Views: 7590
Reputation: 1
For CP866 (Russian) this works:
from zipfile import ZipFile, ZipInfo
class ZipInf(ZipInfo):
def __init__(self, filename):
super().__init__(filename)
self.create_system = 0
def _encodeFilenameFlags(self):
return self.filename.encode('cp866'), self.flag_bits
with ZipFile('ex.zip', 'w') as zipf:
zipf.writestr(ZipInf('Файл'), '123456789'*1024)
It saves dirs and filenames in zip cp866 encoded (here is only 'Файл' file).
Upvotes: 0
Reputation: 1635
try this
import zipfile
p=b'\xd7\xa7\xd7\x95\xd7\x91\xd7\xa5.txt'.decode('utf8')
# or just:
# p='קובץ.txt'
z=zipfile.ZipFile('test.zip','w')
f=z.open(p.encode('utf8').decode('cp437'),'w')
f.write(b'hello world')
f.close()
z.close()
I've tried on a MacOSX, so it's not cp437 above, but utf8, and it works
I hope this works on windows
I've tested reading Chinese filenames with "gbk" or "gb18030" encoding with similar codes. And it works well.
When you have a zip archive from (or needs to send it to) Mac/Linux, change cp437 in the code to utf8 and everything works
When you have a zip archive from (or needs to send it to) Windows, leave cp437 unchanged
Upvotes: 0
Reputation: 1123440
You'd have to encode your Unicode string to CP437. However, you can't encode your specific example because the CP437 codec does not support Hebrew:
>>> u'קובץ.txt'.encode('cp437')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
The above error tells you that the first 4 characters (קובץ
) cannot be encoded because there are no such characters in the target characterset. CP437 only supports the western alphabet (A-Z, and accented characters like ç and é), IBM line drawing characters (such as ╚ and ┤) and a few greek symbols, mainly for math equations (such as Σ and φ).
You'll either have to generate a different filename that only uses characters supported by the CP437 codec or live with the fact that WinZip will never be able to show Hebrew filenames properly, and simply stick with the characterset that did work for you with 7zip.
Upvotes: 8