Jonaswg
Jonaswg

Reputation: 404

Tarfile/Zipfile extractall() changing filename of some files

Hello I am currently working on a tool that has to extract some .tar files.

It works great for the most part but I have one problem:

Some .tar and .zip files have names that include "illegal" characters (f.ex ":"). This program has to run on windows machines so I have to deal with this.

Is there a way I can change the name of some of the files in the extracted output if it contains a ":" or another illegal windows character.

My current implementation:

def read_zip(filepath, extractpath):
    with zipfile.ZipFile(filepath, 'r') as zfile:
        contains_bad_char = False
        for finfo in zfile.infolist():
            if ":" in finfo.filename:
                contains_bad_char = True
        if not contains_bad_char:
            zfile.extractall(path=extractpath)


def read_tar(filepath, extractpath):
    with tarfile.open(filepath, "r:gz") as tar:
        contains_bad_char = False
        for member in tar.getmembers():
            if ":" in member.name:
                contains_bad_char = True
        if not contains_bad_char:
            tar.extractall(path=extractpath)

So currently I am just ignoring these outputs all together, which is not ideal.

To describe better what I am asking for I can provide a small example:

file_with_files.tar -> small_file_1.txt
                    -> small_file_2.txt
                    -> annoying:file_1.txt
                    -> annoying:file_1.txt

Should extract to

file_with_files -> small_file_1.txt
                -> small_file_2.txt
                -> annoying_file_1.txt
                -> annoying_file_1.txt

Is the only solution to iterate over every fileobject in the compressed file and extract one by one or is there a more elegant solution?

Upvotes: 1

Views: 2537

Answers (1)

CristiFati
CristiFati

Reputation: 41116

According to [Python.Docs]: ZipFile.extract(member, path=None, pwd=None):

On Windows illegal characters (:, <, >, |, ", ?, and *) replaced by underscore (_).

So, things are already taken care of:

>>> import os
>>> import zipfile
>>>
>>> os.getcwd()
'e:\\Work\\Dev\\StackOverflow\\q055340013'
>>> os.listdir()
['arch.zip']
>>>
>>> zf = zipfile.ZipFile("arch.zip")
>>> zf.namelist()
['file0.txt', 'file:1.txt']
>>> zf.extractall()
>>> zf.close()
>>>
>>> os.listdir()
['arch.zip', 'file0.txt', 'file_1.txt']

A quick browse over TarFile (source and doc) didn't reveal anything similar (and I wouldn't be very surprised if there wasn't, as .tar format is mainly used on Nix), so you'd have to do it manually. Things aren't as simple as I expected, since TarFile doesn't offer the possibility of extracting a member under a different name, like ZipFile does.
Anyway, here's a piece of code (I had ZipFile and TarFile as muses or sources of inspiration):

code00.py:

#!/usr/bin/env python

import sys
import os
import tarfile


def unpack_tar(filepath, extractpath=".", compression_flag="*"):
    win_illegal = ':<>|"?*'
    table = str.maketrans(win_illegal, '_' * len(win_illegal))
    with tarfile.open(filepath, "r:" + compression_flag) as tar:
        for member in tar.getmembers():
            #print(member, member.isdir(), member.name, member.path)
            #print(type(member))
            if member.isdir():
                os.makedirs(member.path.translate(table), exist_ok=True)
            else:
                with open(os.path.join(extractpath, member.path.translate(table)), "wb") as fout:
                    fout.write(tarfile.ExFileObject(tar, member).read())


def main(*argv):
    unpack_tar("arch00.tar")


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

Note that the above code works for simple .tar files (with simple members, including directories).

Submitted [Python.Bugs]: tarfile: handling Windows (path) illegal characters in archive member names.
I don't know what its outcome is going to be, since I submitted a couple of issues (and also fixes for them) that were more serious (on my PoV), but for various reasons, they were rejected.

Upvotes: 2

Related Questions