Reputation: 4725
In python 3, I am trying to read files that resides in a tar.gz archive without extracting them (meaning without writing the extract files to disk). I found the tarfile module, and this is what I wrote (much simplified):
tar = tarfile.open('arhivename.tar.gz',encoding='utf-8')
for x in tar.getmembers():
filelikeobject=tar.extractfile(x)
#pass the filelikeobject to a third party function that accepts file-like object that read strings
#the following lines are for debug:
r=filelikeobject.read()
print(type(r).__name__) #prints out 'bytes' - need 'str'
the problem is, the tar.extractfile(x) returns a file object that returns bytes when calling read(). I need it to return str using utf-8 encoding
Upvotes: 6
Views: 10413
Reputation: 213368
When you call tarfile.open
,
tarfile.open('arhivename.tar.gz', encoding='utf-8')
The encoding
parameter controls the encoding of the filenames, not the encoding of the file contents. It doesn't make sense for the encoding
parameter to control the encoding of the file contents, because different files inside the tar file can be encoded differently. So, a tar file really just contains binary data.
You can decode this data by wrapping the file with the UTF-8 stream reader from the codecs
module:
import codecs
utf8reader = codecs.getreader('utf-8')
for name in tar.getmembers():
fp = utf8reader(tar.extractfile(name))
Upvotes: 7