user3684314
user3684314

Reputation: 771

Use gzip with archive with multiple files in Python 3

So basically I have a file system like this:

main_archive.tar.gz
  main_archive.tar
    sub_archive.xml.gz
      actual_file.xml

There are hundreds of files in this archive... So basically, can the gzip package be used with multiple files in Python 3? I've only used it with a single file zipped so I'm at a loss on how to go over multiple files or multiple levels of "zipping".

My usual method of decompressing is:

with gzip.open(file_path, "rb") as f:
  for ln in f.readlines():
    *decode encoding here*

Of course, this has multiple problems because usually "f" is just a file... But now I'm not sure what it represents?

Any help/advice would be much appreciated!

EDIT 1:

I've accepted the answer below, but if you're looking for similar code, my backbone was basically:

tar = tarfile.open(file_path, mode="r")
for member in tar.getmembers():
    f = tar.extractfile(member)
    if verbose:
        print("Decoding", member.name, "...")
    with gzip.open(f, "rb") as temp:
        decoded = temp.read().decode("UTF-8")
        e = xml.etree.ElementTree.parse(decoded).getroot()
        for child in e:
            print(child.tag)
            print(child.attrib)
            print("\n\n")

tar.close()

Main packages used were gzip, tarfile, and xml.etree.ElementTree.

Upvotes: 1

Views: 10071

Answers (2)

tripleee
tripleee

Reputation: 189377

gzip only supports compressing a single file or stream. In your case, the extracted stream is a tar object, so you'd use Python's tarfile library to manipulate the extracted contents. This library actually knows how to cope with .tar.gz so you don't need to explicitly extract the gzip yourself.

Upvotes: 4

Mark Adler
Mark Adler

Reputation: 112339

Use Python's tarfile to get the contained files, and then Python's gzip again inside the loop to extract the xml.

Upvotes: 0

Related Questions