Reputation: 3394
I have a binary string that holds two gzip binarys concatenated. (I am reading a binary file log file that concatenated two gzip files together)
In other words, I have the equivalient of:
require 'zlib'
require 'stringio'
File.open('t1.gz', 'w') do |f|
gz = Zlib::GzipWriter.new(f)
gz.write 'part one'
gz.close
end
File.open('t2.gz', 'w') do |f|
gz = Zlib::GzipWriter.new(f)
gz.write 'part 2'
gz.close
end
contents1 = File.open('t1.gz', "rb") {|io| io.read }
contents2 = File.open('t2.gz', "rb") {|io| io.read }
c = contents1 + contents2
gz = Zlib::GzipReader.new(StringIO.new(c))
gz.each do | l |
puts l
end
When I try to unzip the combined string, I only get the first string. How do I get both strings?
Upvotes: 4
Views: 2795
Reputation: 55758
The gzip format uses a footer which contains checksums for previously compressed data. Once the footer is reached, there can't be any more data for the sames gziped data stream.
It seems the Ruby Gzip reader just finishes reading after the first encountered footer, which is technically correct, although many other implementations raise an eror if there is still more data. I don't really know about the exact behavior of Ruby here.
The point is, you can't just concatenate the raw byte streams and expect things to work. You have to actually adapt the streams and rewrite the headers and footers. See this question for details.
Or you could uncompress the streams, concatenate them and re-compress it, but that obviously creates some overhead...
Upvotes: 1
Reputation: 91
This is the correct way to ensure the whole file is read. Even though unused might be nil doesn't mean that the end of the origin gzipped file has been reached.
File.open(path_to_file) do |file|
loop do
gz = Zlib::GzipReader.new file
puts gz.read
unused = gz.unused
gz.finish
adjust = unused.nil? ? 0 : unused.length
file.pos -= adjust
break if file.pos == file.size
end
end
Upvotes: 0
Reputation: 20857
The accepted answer didn't work for me. Here's my modified version. Notice the different usage of gz.unused
.
Also, you should call finish
on the GzipReader
instance to avoid memory leaks.
# gzcat-test.rb
require 'zlib'
require 'stringio'
require 'digest/sha1'
# gzip -c /usr/share/dict/web2 /usr/share/dict/web2a > web-cat.gz
io = File.open('web-cat.gz')
# or, if you don't care about memory usage:
# io = StringIO.new File.read 'web-cat.gz'
# these will be hashes: {orig_name: 'filename', data_arr: unpacked_lines}
entries=[]
loop do
entries << {data_arr: []}
# create a reader starting at io's current position
gz = Zlib::GzipReader.new(io)
entries.last[:orig_name] = gz.orig_name
gz.each {|l| entries.last[:data_arr] << l }
unused = gz.unused # save this before calling #finish
gz.finish
if unused
# Unused is not the entire remainder, but only part of it.
# We need to back up since we've moved past the start of the next entry.
io.pos -= unused.size
else
break
end
end
io.close
# verify the data
entries.each do |entry_hash|
p entry_hash[:orig_name]
puts Digest::SHA1.hexdigest(entry_hash[:data_arr].join)
end
Run:
> ./gzcat-test.rb
web2"
a62edf8685920f7d5a95113020631cdebd18a185
"web2a"
b0870457df2b8cae06a88657a198d9b52f8e2b0a
Our unpacked contents match the originals:
> shasum /usr/share/dict/web*
a62edf8685920f7d5a95113020631cdebd18a185 /usr/share/dict/web2
b0870457df2b8cae06a88657a198d9b52f8e2b0a /usr/share/dict/web2a
Upvotes: 0
Reputation: 15954
while c
io = StringIO.new(c)
gz = Zlib::GzipReader.new(io)
gz.each do | l |
puts l
end
c = gz.unused # take unprocessed portion of the string as the next archive
end
See ruby-doc.
Upvotes: 3