Reputation: 300
I've got a large tar file in s3 (10s of GBs). It contains a number of tar.gz files.
I can loop through the contents of the large file with something like
s3_client = boto3.client('s3')
input = s3_client.get_object(Bucket=bucket, Key=key)
with tarfile.open(fileobj=input['Body'],mode='r|') as tar:
print(tar) -- tarinfo
However I can't seem to open the file contents from the inner tar.gz file.
I want to be able to do this in a streaming manner rather than load the whole file into memory.
I've tried doing things like
tar.extract_file(tar.next)
But I'm not sure how this file like object is then readable.
--- EDIT
I've got slightly further with the help of @larsks.
with tarfile.open(fileobj=input_tar_file['Body'],mode='r|') as tar:
for item in tar:
m = tar.extractfile(item)
if m is not None:
with tarfile.open(fileobj=m, mode='r|gz') as gz:
for data in gz:
d = gz.extractfile(data)
However if I call .read() on d. It is empty. If I traverse through d.raw.fileobj.read() there is data. But when I write that out it's the data from all the text files in the nested tar.gz rather than one by one.
Upvotes: 0
Views: 1611
Reputation: 311675
The return value of tar.extractfile
is a "file-like object", just like input['Body']
. That means you can simply pass that to tarfile.open
. Here's a simple example that prints the contents of a nested archive:
import tarfile
with open('outside.tar', 'rb') as fd:
with tarfile.open(fileobj=fd, mode='r') as outside:
for item in outside:
with outside.extractfile(item) as inside:
with tarfile.open(fileobj=inside, mode='r') as inside_tar:
for item in inside_tar:
data = inside_tar.extractfile(item)
print('content:', data.read())
Here the "outside" file is an actual file, rather than something
coming from an S3 bucket; but I'm opening it first so that we're still
passing in fileobj
when opening the outside archive.
The code iterates through the contents of the outside archive (for item in outside
), and for each of these items:
outside.extractfile()
fileobj
parameter of
tarfile.open
Upvotes: 1