Reputation: 9037
While looking around for ideas I found https://stackoverflow.com/a/54222447/264822 for zip files which I think is a very clever solution. But it relies on zip files having a Central Directory - tar files don't.
I thought I could follow the same general principle and expose the S3 file to tarfile through the fileobj
parameter:
import boto3
import io
import tarfile
class S3File(io.BytesIO):
def __init__(self, bucket_name, key_name, s3client):
super().__init__()
self.bucket_name = bucket_name
self.key_name = key_name
self.s3client = s3client
self.offset = 0
def close(self):
return
def read(self, size):
print('read: offset = {}, size = {}'.format(self.offset, size))
start = self.offset
end = self.offset + size - 1
try:
s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
except:
return bytearray()
self.offset = self.offset + size
result = s3_object['Body'].read()
return result
def seek(self, offset, whence=0):
if whence == 0:
print('seek: offset {} -> {}'.format(self.offset, offset))
self.offset = offset
def tell(self):
return self.offset
s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
print(name)
This works fine except the output looks like:
read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)
tarfile is just reading the whole file anyway so I haven't gained anything. Is there anyway of making tarfile only read the parts of the file it needs? The only alternative I can think of is re-implementing the tar file parsing so it:
BytesIO
buffer.BytesIO
buffer.But this seems overly complicated.
Upvotes: 11
Views: 6625
Reputation: 10335
I just tested your original code on a tar file and it works quite well.
Here is my sample output (truncated). I made some minor changes to display the total downloaded bytes and the seek step size in kB (published at this gist). This is for a 1 GB tar file containing 321 files (average size per file is 3 MB):
read: offset = 0, size = 2, total download = 2
seek: offset 2 -> 0 (diff = -1 kB)
read: offset = 0, size = 8192, total download = 8194
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 8192, total download = 16386
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 512, total download = 16898
<TarInfo 'yt.txt' at 0x7fbbed639ef0>
seek: offset 512 -> 7167 (diff = 6 kB)
read: offset = 7167, size = 1, total download = 16899
read: offset = 7168, size = 512, total download = 17411
<TarInfo 'yt_cache/youtube-sigfuncs' at 0x7fbbed639e20>
read: offset = 7680, size = 512, total download = 17923
...
<TarInfo 'yt_vids/whistle_dolphins-SZTC_zT9ijg.m4a' at 0x7fbbecc697a0>
seek: offset 1004473856 -> 1005401599 (diff = 927 kB)
read: offset = 1005401599, size = 1, total download = 211778
read: offset = 1005401600, size = 512, total download = 212290
None
322
So this downloads 212 kB for a 1GB tar file in order to get a list of 321 filenames in about 2 minutes on colab and 1.5 minutes on ec2 in the same region as the bucket.
In comparison, it takes 17 seconds to download the full file on colab and 1 second to list the files in it with tar -tf file.tar
. So if I'm optimizing on execution time, I'd rather just download the full file and work on it locally. Otherwise, there might be some optimization that could be done in your original code? IDK.
OTOH, fetching a single file is more efficient than the above 2 minutes if it's at the beginning of the tar, but as slow as getting all file names if it's at the end. But I couldn't do that with the getmember()
function because it seems that it internally calls getmembers()
which has to go through the full file. Instead, I rolled out my own while loop to find the file and abort the search once found:
bucket_name, file_name = "bucket", "file.tar"
import boto3
s3client = boto3.client("s3")
s3file = S3File(bucket_name, file_name, s3client)
import tarfile
with tarfile.open(mode="r", fileobj=s3file) as tarf:
tarinfo = 1 # dummy
while tarinfo is not None:
tarinfo = tarf.next()
if tarinfo.name == name_search:
break
I think a future direction for this would be to have the tarinfo.open(...)
cache the offsets of each file so that a subsequent tarinfo.open(...)
doesn't go through the full file again. Once that's done, a first pass through the tar file will allow downloading individual files from the tar in s3 without going through the full file again and again for reach file.
Side note, couldn't you have just run gunzip on the tar.gz to get the tar to test on?
Upvotes: 1
Reputation: 9037
My mistake. I'm actually dealing with tar.gz files but I assumed that zip and tar.gz are similar. They're not - tar is an archive file which is then compressed as gzip, so to read the tar you have to decompress it first. My idea of pulling bits out of the tar file won't work.
What does work is:
s3_object = s3client.get_object(Bucket=bucket_name, Key=file_name)
wholefile = s3_object['Body'].read()
fileobj = io.BytesIO(wholefile)
tarf = tarfile.open(fileobj=fileobj)
names = tarf.getnames()
for name in names:
print(name)
I suspect the original code will work for a tar file but I don't have any to try it on.
Upvotes: 7