Kevin Taehyun Kang
Kevin Taehyun Kang

Reputation: 13

Append string to an existing gzipfile in Ruby

I am trying to read a gzip file and append a part of the gzip file (which is string) to another existing gzip file. The size of string is ~3000 lines. I will have to do this multiple times (~10000 times) in ruby. What would be the most efficient way of doing this?. The zlib library does not support appending and using backticks (gzip -c orig_gzip >> gzip.gz) seems to be too slow. The resulting file should be a gigantic text file

Upvotes: 1

Views: 1535

Answers (3)

Edgar Ortega
Edgar Ortega

Reputation: 1735

You need to open the gzipped file in binary mode (b) and also in append mode (a), in my case it is a gzipped CSV file.

file = File.open('path-to-file.csv.gz', 'ab')
gz = Zlib::GzipWriter.new(f)
gz.write("new,row,csv\n")
gz.close

If you open the file in w mode, you will overwrite the content of the file. Check the documentation for full description of open modes http://ruby-doc.org/core-2.5.3/IO.html#method-c-new

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160611

It's not clear what you are looking for. If you are trying to join multiple files into one gzip archive, you can't get there. Per the gzip documentation:

Can gzip compress several files into a single archive?

Not directly. You can first create a tar file then compress it: for GNU tar: gtar cvzf file.tar.gz filenames for any tar: tar cvf - filenames | gzip > file.tar.gz

Alternatively, you can use zip, PowerArchiver 6.1, 7-zip or Winzip. The zip format allows random access to any file in the archive, but the tar.gz format usually gives a better compression ratio.

With the number of times you will be adding to the archive, it makes more sense to expand the source then append the string to a single file, then compress on demand or a cycle.

You will have a large file but the compression time would be fast.


If you want to accumulate data, not separate files, in a gzip file without expanding it all, it's possible from Ruby to append to an existing gzip file, however you have to specify the "a" ("Append") mode when opening your original .gzip file. Failing to do that causes your original to be overwritten:

require 'zlib'

File.open('main.gz', 'a') do |main_gz_io|
  Zlib::GzipWriter.wrap(main_gz_io) do |main_gz|
    5.times do
      print '.'
      main_gz.puts Time.now.to_s
      sleep 1
    end
  end
end
puts 'done'
puts 'viewing output:'
puts '---------------'
puts `gunzip -c main.gz`

Which, when run, outputs:

.....done
viewing output:
---------------
2013-04-10 12:06:34 -0700
2013-04-10 12:06:35 -0700
2013-04-10 12:06:36 -0700
2013-04-10 12:06:37 -0700
2013-04-10 12:06:38 -0700

Run that several times and you'll see the output grow.

Whether this code is fast enough for your needs is hard to say. This example artificially drags its feet to write once a second.

Upvotes: 4

Mark Adler
Mark Adler

Reputation: 112547

It sounds like your appended data is long enough that it would be efficient enough to simply compress the 3000 lines to a gzip stream and append that to the existing gzip stream. gzip has the property that two valid gzip streams concatenated is also a valid gzip stream, and that gzip stream decompresses to the concatenation of the decompressions of the two original gzip streams.

I don't understand "(gzip -c orig_gzip >> gzip.gz) seems to be too slow". That would be the fastest way. If you don't like the time spent compressing, you can reduce the compression level, e.g. gzip -1.

The zlib library actually supports quite a bit, when the low-level functions are used. You can see advanced examples of gzip appending in the examples/ directory of the zlib distribution. You can look at gzappend.c, which appends more efficiently, in terms of compression, than a simple concatenation, by first decompressing the existing gzip stream and picking up compression where the previous stream left off. gzlog.h and gzlog.c provide an efficient and robust way to append short messages to a gzip stream.

Upvotes: 2

Related Questions