Reputation: 91
I have a few 10gb csv files in s3 that I'd like to use to seed my DB. I'm running a RoR application on Heroku and I can't figure out how to stream the csv line by line to process it as it's way too large to fit in memory and I can't use File.open to access an external file.
I've looked into using Tempfile to stream bytes at a time, but they don't match up with new lines and reconstructing this in Ruby is difficult.
Thank you!
Upvotes: 7
Views: 4702
Reputation: 4381
For Aws::S3
V2:
s3 = Aws::S3::Client.new
File.open('filename', 'wb') do |file|
s3.get_object(bucket: 'bucket-name', key:'object-key') do |chunk|
file.write(chunk)
end
end
Upvotes: 0
Reputation: 52376
You can read a stream, as described in the API documentation: http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html
s3 = AWS::S3.new
large_object = s3.buckets['my-bucket'].objects['key'] # no request made
File.open('output', 'wb') do |file|
large_object.read do |chunk|
file.write(chunk)
end
end
You can also use range
as an option to read a range of bytes.
http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html#read-instance_method
Upvotes: 2
Reputation: 2927
@David Please note, when using blocks to downloading objects, the Ruby SDK will NOT retry failed requests after the first chunk of data has been yielded. Doing so could cause file corruption on the client end by starting over mid-stream.
When downloading large objects from Amazon S3, you typically want to stream the object directly to a file on disk. This avoids loading the entire object into memory. You can specify the :target for any AWS operation as an IO object.
File.open('filename', 'wb') do |file|
reap = s3.get_object({ bucket:'bucket-name', key:'object-key' }, target: file)
end
Here is the official link.
Upvotes: 1