Tihom
Tihom

Reputation: 3394

Ruby parsing gzip binary string

I have a binary string that holds two gzip binarys concatenated. (I am reading a binary file log file that concatenated two gzip files together)

In other words, I have the equivalient of:

require 'zlib'
require 'stringio'

File.open('t1.gz', 'w') do |f|
  gz = Zlib::GzipWriter.new(f)
  gz.write 'part one'
  gz.close
end

File.open('t2.gz', 'w') do |f|
  gz = Zlib::GzipWriter.new(f)
  gz.write 'part 2'
  gz.close
end


contents1 = File.open('t1.gz', "rb") {|io| io.read }
contents2 = File.open('t2.gz', "rb") {|io| io.read }

c = contents1 + contents2

gz = Zlib::GzipReader.new(StringIO.new(c))

gz.each do | l |
    puts l
end

When I try to unzip the combined string, I only get the first string. How do I get both strings?

Upvotes: 4

Views: 2795

Answers (4)

Holger Just
Holger Just

Reputation: 55758

The gzip format uses a footer which contains checksums for previously compressed data. Once the footer is reached, there can't be any more data for the sames gziped data stream.

It seems the Ruby Gzip reader just finishes reading after the first encountered footer, which is technically correct, although many other implementations raise an eror if there is still more data. I don't really know about the exact behavior of Ruby here.

The point is, you can't just concatenate the raw byte streams and expect things to work. You have to actually adapt the streams and rewrite the headers and footers. See this question for details.

Or you could uncompress the streams, concatenate them and re-compress it, but that obviously creates some overhead...

Upvotes: 1

monde
monde

Reputation: 91

This is the correct way to ensure the whole file is read. Even though unused might be nil doesn't mean that the end of the origin gzipped file has been reached.

File.open(path_to_file) do |file|
  loop do
    gz = Zlib::GzipReader.new file
    puts gz.read

    unused = gz.unused
    gz.finish

    adjust = unused.nil? ? 0 : unused.length
    file.pos -= adjust
    break if file.pos == file.size
  end
end

Upvotes: 0

Kelvin
Kelvin

Reputation: 20857

The accepted answer didn't work for me. Here's my modified version. Notice the different usage of gz.unused.

Also, you should call finish on the GzipReader instance to avoid memory leaks.

# gzcat-test.rb
require 'zlib'
require 'stringio'
require 'digest/sha1'

# gzip -c /usr/share/dict/web2 /usr/share/dict/web2a > web-cat.gz
io = File.open('web-cat.gz')
# or, if you don't care about memory usage:
# io = StringIO.new File.read 'web-cat.gz'

# these will be hashes: {orig_name: 'filename', data_arr: unpacked_lines}
entries=[]
loop do
  entries << {data_arr: []}
  # create a reader starting at io's current position
  gz = Zlib::GzipReader.new(io)
  entries.last[:orig_name] = gz.orig_name
  gz.each {|l| entries.last[:data_arr] << l }
  unused = gz.unused  # save this before calling #finish
  gz.finish

  if unused
    # Unused is not the entire remainder, but only part of it.
    # We need to back up since we've moved past the start of the next entry.
    io.pos -= unused.size
  else
    break
  end
end

io.close

# verify the data
entries.each do |entry_hash|
  p entry_hash[:orig_name]
  puts Digest::SHA1.hexdigest(entry_hash[:data_arr].join)
end

Run:

> ./gzcat-test.rb
web2"
a62edf8685920f7d5a95113020631cdebd18a185
"web2a"
b0870457df2b8cae06a88657a198d9b52f8e2b0a

Our unpacked contents match the originals:

> shasum /usr/share/dict/web*
a62edf8685920f7d5a95113020631cdebd18a185  /usr/share/dict/web2
b0870457df2b8cae06a88657a198d9b52f8e2b0a  /usr/share/dict/web2a

Upvotes: 0

undur_gongor
undur_gongor

Reputation: 15954

while c
  io = StringIO.new(c)
  gz = Zlib::GzipReader.new(io)
  gz.each do | l |
    puts l
  end
  c = gz.unused   # take unprocessed portion of the string as the next archive
end

See ruby-doc.

Upvotes: 3

Related Questions