Taylor
Taylor

Reputation: 61

Zlib gunzip only returning partial file

I have a 27MB .gz file (127MB unzipped). Using ruby's Zlib to ungzip the file returns correctly formatted data, but the file is truncated to a fraction of the expected size (1290 rows of data out of 253,000).

string_io = StringIO.new(body)
file = File.new("test.json.gz", "w+")
file.puts string_io.read
file.close

# string_io.read.length == 26_675_650
# File.size("test.json.gz") == 27_738_775

Using GzipReader:

data = ""
File.open(file.path) do |f|
  gz = Zlib::GzipReader.new(f)
  data << gz.read
  gz.close
end
# data.length = 603_537

Using a different GzipReader method:

data = ""
Zlib::GzipReader.open(file.path) do |gz|
  data << gz.read
end
# data.length == 603_537

Using gunzip:

gz = Zlib.gunzip(string_io.read)
# gz.length == 603_537

The expected size is 127,604,690 but I'm only able to extract 603,537. Using gunzip in my terminal correctly extracts the entire file but I'm looking for a programmatic way to handle this.

Upvotes: 1

Views: 323

Answers (2)

Eric Herot
Eric Herot

Reputation: 386

There is another possibility for why you might be having problems here...

Some GZip data stored by AWS in S3 is in a format that is designed to be streamed a little at a time. As such there are actually multiple GZip chunks concatenated together into a single file. This confuses Zip::GzipReader, which expects to be able to read a single GZip file from start to finish with one header and one footer (The issue is described here).

Thankfully they've added a relatively easy workaround.

Instead of this:

data = ""
File.open(file.path) do |f|
  gz = Zlib::GzipReader.new(f)
  data << gz.read
  gz.close
end

Do this:

data = File.open('tmpfile.gz') { |f| Zlib::GzipReader.zcat(f) }

Upvotes: 0

Derek Wright
Derek Wright

Reputation: 1492

Instead of opening a file and passing a file handler, have you tried using Zlib::GzipReader.open()? It's documented here https://ruby-doc.org/stdlib/libdoc/zlib/rdoc/Zlib/GzipReader.html

I tested locally and was able to get proper results:

data = ''
=> ""

Zlib::GzipReader.open('file.tar.gz') { |gz|
  data << gz.read
}

data.length
=> 750003

Then checked the file size uncompressed:

gzip -l file.tar.gz                                                                                                                           
  compressed uncompressed  ratio uncompressed_name
      315581       754176  58.1% file.tar

Edit: Saw your update that you are pulling the data via S3 API. Make sure you are Base64 decoding your body before writing it to a file.

Upvotes: 2

Related Questions