Nowandthen98
Nowandthen98

Reputation: 300

Unpack nested tar files in s3 in streaming fashion

I've got a large tar file in s3 (10s of GBs). It contains a number of tar.gz files.

I can loop through the contents of the large file with something like


    s3_client = boto3.client('s3')
    input = s3_client.get_object(Bucket=bucket, Key=key)

    with tarfile.open(fileobj=input['Body'],mode='r|') as tar:
        print(tar) -- tarinfo

However I can't seem to open the file contents from the inner tar.gz file.

I want to be able to do this in a streaming manner rather than load the whole file into memory.

I've tried doing things like

tar.extract_file(tar.next)

But I'm not sure how this file like object is then readable.

--- EDIT

I've got slightly further with the help of @larsks.


 with tarfile.open(fileobj=input_tar_file['Body'],mode='r|') as tar:
        for item in tar:
            m = tar.extractfile(item)
            if m is not None:
                with tarfile.open(fileobj=m, mode='r|gz') as gz:
                    for data in gz:
                        d = gz.extractfile(data)

However if I call .read() on d. It is empty. If I traverse through d.raw.fileobj.read() there is data. But when I write that out it's the data from all the text files in the nested tar.gz rather than one by one.

Upvotes: 0

Views: 1611

Answers (1)

larsks
larsks

Reputation: 311675

The return value of tar.extractfile is a "file-like object", just like input['Body']. That means you can simply pass that to tarfile.open. Here's a simple example that prints the contents of a nested archive:

import tarfile


with open('outside.tar', 'rb') as fd:
    with tarfile.open(fileobj=fd, mode='r') as outside:
        for item in outside:
            with outside.extractfile(item) as inside:
                with tarfile.open(fileobj=inside, mode='r') as inside_tar:
                    for item in inside_tar:
                        data = inside_tar.extractfile(item)
                        print('content:', data.read())

Here the "outside" file is an actual file, rather than something coming from an S3 bucket; but I'm opening it first so that we're still passing in fileobj when opening the outside archive.

The code iterates through the contents of the outside archive (for item in outside), and for each of these items:

  • Open the file using outside.extractfile()
  • Pass that as the argument to the fileobj parameter of tarfile.open
  • Extract each item inside the nested tarfile

Upvotes: 1

Related Questions