Reputation: 6045
I have a bucket with 4+ million files (50GB+). I'd like to get the list of files (without the data) using Python without downloading the files.
files = s3_bucket.objects.filter(Prefix='myPrefix')
# print(len(list(files_raw)))
for key in files:
print(key.last_modified)
I have something like this but I notice there's a lot of data coming through the network.
I was trying to look at the documentation for ObjectSummary and I was hoping it only downloads the metadata. ObjectSummary and HEAD operation
The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you're only interested in an object's metadata. To use HEAD, you must have READ access to the object.
A HEAD request has the same options as a GET operation on an object. The response is identical to the GET response except that there is no response body.
Is it still having to download the entire file just to retrieve the filenames?
Upvotes: 1
Views: 984
Reputation: 269091
When using the resource method in boto3, the requests actually get translated into other API calls. However, it's not easy to see what calls happen "behind the scenes". Sometimes one method can translate into multiple calls (eg ListObjects
and HeadObject
).
You might consider using the client method of calls, since they map 1:1 to the API calls on AWS:
import boto3
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(Bucket='bucket-name')
for page in response_iterator:
for object in page['Contents']:
print(object['Key'], object['LastModified'])
I would also recommend that you look at Amazon S3 Inventory. It can provide a daily CSV file containing a list of all objects and their metadata. This is very useful for large buckets (such as yours).
Upvotes: 1