Reputation: 404
Hello I am currently working on a tool that has to extract some .tar files.
It works great for the most part but I have one problem:
Some .tar and .zip files have names that include "illegal" characters (f.ex ":"). This program has to run on windows machines so I have to deal with this.
Is there a way I can change the name of some of the files in the extracted output if it contains a ":" or another illegal windows character.
My current implementation:
def read_zip(filepath, extractpath):
with zipfile.ZipFile(filepath, 'r') as zfile:
contains_bad_char = False
for finfo in zfile.infolist():
if ":" in finfo.filename:
contains_bad_char = True
if not contains_bad_char:
zfile.extractall(path=extractpath)
def read_tar(filepath, extractpath):
with tarfile.open(filepath, "r:gz") as tar:
contains_bad_char = False
for member in tar.getmembers():
if ":" in member.name:
contains_bad_char = True
if not contains_bad_char:
tar.extractall(path=extractpath)
So currently I am just ignoring these outputs all together, which is not ideal.
To describe better what I am asking for I can provide a small example:
file_with_files.tar -> small_file_1.txt
-> small_file_2.txt
-> annoying:file_1.txt
-> annoying:file_1.txt
Should extract to
file_with_files -> small_file_1.txt
-> small_file_2.txt
-> annoying_file_1.txt
-> annoying_file_1.txt
Is the only solution to iterate over every fileobject in the compressed file and extract one by one or is there a more elegant solution?
Upvotes: 1
Views: 2537
Reputation: 41116
According to [Python.Docs]: ZipFile.extract(member, path=None, pwd=None):
On Windows illegal characters (
:
,<
,>
,|
,"
,?
, and*
) replaced by underscore (_
).
So, things are already taken care of:
>>> import os >>> import zipfile >>> >>> os.getcwd() 'e:\\Work\\Dev\\StackOverflow\\q055340013' >>> os.listdir() ['arch.zip'] >>> >>> zf = zipfile.ZipFile("arch.zip") >>> zf.namelist() ['file0.txt', 'file:1.txt'] >>> zf.extractall() >>> zf.close() >>> >>> os.listdir() ['arch.zip', 'file0.txt', 'file_1.txt']
A quick browse over TarFile (source and doc) didn't reveal anything similar (and I wouldn't be very surprised if there wasn't, as .tar format is mainly used on Nix), so you'd have to do it manually. Things aren't as simple as I expected, since TarFile doesn't offer the possibility of extracting a member under a different name, like ZipFile does.
Anyway, here's a piece of code (I had ZipFile and TarFile as muses or sources of inspiration):
code00.py:
#!/usr/bin/env python
import sys
import os
import tarfile
def unpack_tar(filepath, extractpath=".", compression_flag="*"):
win_illegal = ':<>|"?*'
table = str.maketrans(win_illegal, '_' * len(win_illegal))
with tarfile.open(filepath, "r:" + compression_flag) as tar:
for member in tar.getmembers():
#print(member, member.isdir(), member.name, member.path)
#print(type(member))
if member.isdir():
os.makedirs(member.path.translate(table), exist_ok=True)
else:
with open(os.path.join(extractpath, member.path.translate(table)), "wb") as fout:
fout.write(tarfile.ExFileObject(tar, member).read())
def main(*argv):
unpack_tar("arch00.tar")
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Note that the above code works for simple .tar files (with simple members, including directories).
Submitted [Python.Bugs]: tarfile: handling Windows (path) illegal characters in archive member names.
I don't know what its outcome is going to be, since I submitted a couple of issues (and also fixes for them) that were more serious (on my PoV), but for various reasons, they were rejected.
Upvotes: 2