yigal
yigal

Reputation: 4725

Read *.tar.gz file in python without extracting

In python 3, I am trying to read files that resides in a tar.gz archive without extracting them (meaning without writing the extract files to disk). I found the tarfile module, and this is what I wrote (much simplified):

tar = tarfile.open('arhivename.tar.gz',encoding='utf-8')
for x in tar.getmembers():
    filelikeobject=tar.extractfile(x)
    #pass the filelikeobject to a third party function that accepts file-like object that read strings

    #the following lines are for debug:
    r=filelikeobject.read()
    print(type(r).__name__) #prints out 'bytes' - need 'str'

the problem is, the tar.extractfile(x) returns a file object that returns bytes when calling read(). I need it to return str using utf-8 encoding

Upvotes: 6

Views: 10413

Answers (1)

Dietrich Epp
Dietrich Epp

Reputation: 213368

When you call tarfile.open,

tarfile.open('arhivename.tar.gz', encoding='utf-8')

The encoding parameter controls the encoding of the filenames, not the encoding of the file contents. It doesn't make sense for the encoding parameter to control the encoding of the file contents, because different files inside the tar file can be encoded differently. So, a tar file really just contains binary data.

You can decode this data by wrapping the file with the UTF-8 stream reader from the codecs module:

import codecs
utf8reader = codecs.getreader('utf-8')
for name in tar.getmembers():
    fp = utf8reader(tar.extractfile(name))

Upvotes: 7

Related Questions