RTF
RTF

Reputation: 6524

Does the AWS SDK for Ruby download S3 objects during bucket enumeration?

When using the Amazon Ruby SDK for S3, I need to enumerate ALL the files in a huge bucket in order to identify any empty files i.e. obj.content_length == 0

I've written a script to do that like this:

bucket.objects.each() do |obj|
  total_objs += 1

  if obj.content_length == 0 then
    empty_files += 1
    puts obj.key
  end
end

...but I'm concerned that this will result in each file being downloaded to determine the file size. Does the SDK actually download the file to know the size, or is it just metadata that gets pulled and then the object gets downloaded lazily if the appropriate method is called?

Also, is there a more efficient way to achieve what I'm trying to do?

Upvotes: 1

Views: 1506

Answers (1)

Trevor Rowe
Trevor Rowe

Reputation: 6528

The easiest way to get what you want is to use the v2 AWS SDK for Ruby, available as aws-sdk-core:

require 'aws-sdk-core'

empty_files = 0    

s3 = Aws::S3::Client.new
s3.list_objects(bucket:'aws-sdk').each do |resp|
  resp.contents.each do |obj|
    if obj.content_length == 0
      empty_files += 1
      puts obj.key
    end
  end
end

The code above makes exactly 1 request per 1k objects (S3 only returns information about 1k objects per response). It uses the SDK's built in client response paging feature to ensure you keep calling #list_objects until you've exhausted the bucket. This will not download the object bodies, you can call Aws::S3::Client#get_object to do that.

UPDATE:

The v2 SDK now supports this with a resource-oriented interface. The same code example above using aws-sdk-resources:

require 'aws-sdk' # must be v2 sdk

empty_files = 0

s3 = Aws::S3::Resource.new
s3.bucket('aws-sdk').objects.each do |obj|
  if obj.size == 0
    empty_files += 1
    puts obj.key
  end
end

Upvotes: 2

Related Questions