zasmail
zasmail

Reputation: 91

Stream a large file line by line from S3

I have a few 10gb csv files in s3 that I'd like to use to seed my DB. I'm running a RoR application on Heroku and I can't figure out how to stream the csv line by line to process it as it's way too large to fit in memory and I can't use File.open to access an external file.

I've looked into using Tempfile to stream bytes at a time, but they don't match up with new lines and reconstructing this in Ruby is difficult.

Thank you!

Upvotes: 7

Views: 4702

Answers (3)

juliangonzalez
juliangonzalez

Reputation: 4381

For Aws::S3 V2:

s3 = Aws::S3::Client.new
File.open('filename', 'wb') do |file|
  s3.get_object(bucket: 'bucket-name', key:'object-key') do |chunk|
    file.write(chunk)
  end
end

Upvotes: 0

David Aldridge
David Aldridge

Reputation: 52376

You can read a stream, as described in the API documentation: http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html

s3 = AWS::S3.new
large_object = s3.buckets['my-bucket'].objects['key'] # no request made

File.open('output', 'wb') do |file|
  large_object.read do |chunk|
    file.write(chunk)
  end
end

You can also use range as an option to read a range of bytes.

http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html#read-instance_method

Upvotes: 2

Imran Ahmad
Imran Ahmad

Reputation: 2927

@David Please note, when using blocks to downloading objects, the Ruby SDK will NOT retry failed requests after the first chunk of data has been yielded. Doing so could cause file corruption on the client end by starting over mid-stream.

When downloading large objects from Amazon S3, you typically want to stream the object directly to a file on disk. This avoids loading the entire object into memory. You can specify the :target for any AWS operation as an IO object.

File.open('filename', 'wb') do |file|
 reap = s3.get_object({ bucket:'bucket-name', key:'object-key' }, target: file)
end

Here is the official link.

Upvotes: 1

Related Questions