Evan Zamir
Evan Zamir

Reputation: 8461

Reading in gzipped data from S3 in Ruby

My company has data messages (json) stored in gzipped files on Amazon S3. I want to use Ruby to iterate through the files and do some analytics. I started to use the 'aws/s3' gem, and get get each file as an object:

#<AWS::S3::S3Object:0x4xxx4760 '/my.company.archive/data/msg/20131030093336.json.gz'> 

But once I have this object, I do not know how to unzip it or even access the data inside of it.

Upvotes: 2

Views: 3247

Answers (3)

Nivetha R
Nivetha R

Reputation: 51

For me the below set of steps worked:

  1. Step to read and write the csv.gz from S3 client to local file
  2. Open the local csv.gz file using gzipreader and read csv from it
file_path = "/tmp/gz/x.csv.gz"
File.open(file_path, mode="wb") do |f|
  s3_client.get_object(bucket: bucket, key: key) do |gzfiledata|
  f.write gzfiledata
 end
end

data = []
Zlib::GzipReader.open(file_path) do |gz_reader|
 csv_reader = ::FastestCSV.new(gz_reader)
 csv_reader.each do |csv|
  data << csv
 end
end

Upvotes: 1

matrik
matrik

Reputation: 114

The S3Object documentation is updated and the stream method is no longer available: https://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html

So, the best way to read data from an S3 object would be this:

json_data = Zlib::GzipReader.new(StringIO.new(your_object.read)).read

Upvotes: 0

struthersneil
struthersneil

Reputation: 2750

You can see the documentation for S3Object here: http://amazon.rubyforge.org/doc/classes/AWS/S3/S3Object.html.

You can fetch the content by calling your_object.value; see if you can get that far. Then it should be a question of unpacking the gzip blob. Zlib should be able to handle that.

I'm not sure if .value returns you a big string of binary data or an IO object. If it's a string, you can wrap it in a StringIO object to pass it to Zlib::GzipReader.new, e.g.

json_data = Zlib::GzipReader.new(StringIO.new(your_object.value)).read  

S3Object has a stream method, which I would hope behaves like a IO object (I can't test that here, sorry). If so, you could do this:

json_data = Zlib::GzipReader.new(your_object.stream).read 

Once you have the unzipped json content, you can just call JSON.parse on it, e.g.

JSON.parse Zlib::GzipReader.new(StringIO.new(your_object.value)).read

Upvotes: 1

Related Questions