Reputation: 6524
When using the Amazon Ruby SDK for S3, I need to enumerate ALL the files in a huge bucket in order to identify any empty files i.e. obj.content_length == 0
I've written a script to do that like this:
bucket.objects.each() do |obj|
total_objs += 1
if obj.content_length == 0 then
empty_files += 1
puts obj.key
end
end
...but I'm concerned that this will result in each file being downloaded to determine the file size. Does the SDK actually download the file to know the size, or is it just metadata that gets pulled and then the object gets downloaded lazily if the appropriate method is called?
Also, is there a more efficient way to achieve what I'm trying to do?
Upvotes: 1
Views: 1506
Reputation: 6528
The easiest way to get what you want is to use the v2 AWS SDK for Ruby, available as aws-sdk-core
:
require 'aws-sdk-core'
empty_files = 0
s3 = Aws::S3::Client.new
s3.list_objects(bucket:'aws-sdk').each do |resp|
resp.contents.each do |obj|
if obj.content_length == 0
empty_files += 1
puts obj.key
end
end
end
The code above makes exactly 1 request per 1k objects (S3 only returns information about 1k objects per response). It uses the SDK's built in client response paging feature to ensure you keep calling #list_objects until you've exhausted the bucket. This will not download the object bodies, you can call Aws::S3::Client#get_object
to do that.
UPDATE:
The v2 SDK now supports this with a resource-oriented interface. The same code example above using aws-sdk-resources
:
require 'aws-sdk' # must be v2 sdk
empty_files = 0
s3 = Aws::S3::Resource.new
s3.bucket('aws-sdk').objects.each do |obj|
if obj.size == 0
empty_files += 1
puts obj.key
end
end
Upvotes: 2