Robert
Robert

Reputation: 95

Using tarfile.extractfile and Python3.8 returns the name of the archive instead of the archived files

I'm trying to get the mime type, read and parse some archived files using the next code:

archive_file.tar.gz ---> file.csv, file.json, file.xlsx etc.

def parse_tar_gzip(element):

     from my_lib import parse_file
     from my_lib import NestedArchives

     try:
         tar = tarfile.open(fileobj=element, mode="r")
     except tarfile.ReadError:
         raise NestedArchives(element)
     else:
         for mem in tar.getmembers():
            if mem.isfile():
                my_mems = mem.name.split("/")[-1]
                if not my_mems.startswith("."):
                     my_file = tar.extractfile(mem)
                     # my_mime = mimetypes.guess_type(my_file)
                     print(my_file)

                     # yield "", parse_file(my_file)


with open('/Users/my_name/Downloads/archive_file.tar.gz', 'rb') as my_files:
    blabla = parse_tar_gzip(my_files)
    print(blabla)

The problem is that my_file is returned as ExFileObject having the name archive_file.tar.gz instead of the name of the files inside the archive (e.g:file.json or file.xlsx) as bellow:

<ExFileObject name='/Users/my_name/Downloads/archive_file.tar.gz'>
<ExFileObject name='/Users/my_name/Downloads/archive_file.tar.gz'>
<ExFileObject name='/Users/my_name/Downloads/archive_file.tar.gz'>
<ExFileObject name='/Users/my_name/Downloads/archive_file.tar.gz'>

Shouldn't extractfile return the name of the files inside the archive? This is very strange because when I was using python2.x there were the files name...

Upvotes: 2

Views: 1004

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155333

The ExFileObject is constructed from the underlying file handle to the tarball, without knowing the member being extracted (it's just told the offset, size and sparseness of the member being extracted). So it doesn't know the name of the thing being extracted, it only has the name of the original tarball as shown.

Given that .name is supposed tell you about the file system name of the open file object, it's arguably correct, if somewhat misleading, to do this; you don't have a handle to an actual file system object based on the member name, just a handle to the tarball itself. You have access to the name at the moment you call extractfile, so just hold on to that information if you need it. The point of extractfile is to get the data, not the name it was stored under after all.

Upvotes: 2

Related Questions