Alex
Alex

Reputation: 44405

How to read and list a tgz file in python3?

In python 3 (3.6.8) I want to read a gzipped tar file and list its content.

I found this solution which yields an error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Searching for this error in found this suggestion so I tried the following code snippet:

with open(out_file) as fd:
    gzip_fd = gzip.GzipFile(fileobj=fd)
    tar = tarfile.open(gzip_fd.read())

which yields the same error!

So how to do it right?

Even when looking at the actual documentation here I came up with the following code:

tar = tarfile.open(out_file, "w:gz")
for member in tar.getnames():
   print(tar.extractfile(member).read())

which finally worked without errors - but did not print any content of the tar archive on the screen!

The tar file is well formatted and contains folders and files. (I need to try to share this file)

Upvotes: 1

Views: 4166

Answers (3)

Alex
Alex

Reputation: 44405

Not sure why it did not work before, but the following solution works for me in order to list the files and folders of a gzipped tar archive with python 3.6:

tar = tarfile.open(filename, "r:gz")
print(tar.getnames())

Upvotes: 1

Chocorean
Chocorean

Reputation: 888

The python-archive module (available on pip) could help you:

from archive import extract

file = "you/file.tgz"
try:
    extract(file, "out/%s.raw" % (file), ext=".tgz")
except:
    # could not extract
    pass

Available extensions are (v0.2): '.zip', '.egg', '.jar', '.tar', '.tar.gz', '.tgz', '.tar.bz2', '.tz2'

More info: https://pypi.org/project/python-archive/

Upvotes: 0

abdusco
abdusco

Reputation: 11151

When you open a file without specifying mode it defaults to reading it as text. You need to open the file as raw byte stream using mode='rb' flag then feed it to gzip reader

with open(out_file, mode='rb') as fd:
    gzip_fd = gzip.GzipFile(fileobj=fd)
    tar = tarfile.open(gzip_fd.read())

Upvotes: 0

Related Questions