zulqarnain
zulqarnain

Reputation: 1735

Ruby create tar ball in chunks to avoid out of memory error

I'm trying to re-use the following code to create a tar ball:

tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","w")
      Gem::Package::TarWriter.new(tarfile) do |tar|
        Dir[File.join(path, "**/*")].each do |file|
          mode = File.stat(file).mode
          relative_file = file.sub /^#{Regexp::escape path}\/?/, ''
          if File.directory?(file)
            tar.mkdir relative_file, mode
          else
            tar.add_file relative_file, mode do |tf|
              File.open(file, "rb") { |f| tf.write f.read }
            end
          end
        end
      end
      tarfile.rewind
      tarfile

It works fine as far as only small folders are involve but anything large will fail with the following error:

Error: Your application used more memory than the safety cap

How can I do it in chunks to avoid the memory problems?

Upvotes: 2

Views: 1248

Answers (2)

the Tin Man
the Tin Man

Reputation: 160571

It looks like the problem could be in this line:

File.open(file, "rb") { |f| tf.write f.read }

You are "slurping" your input file by doing f.read. slurping means the entire file is being read into memory, which isn't scalable at all, and is the result of using read without a length.

Instead, I'd do something to read and write the file in blocks so you have a consistent memory usage. This reads in 1MB blocks. You can adjust that for your own needs:

BLOCKSIZE_TO_READ = 1024 * 1000

File.open(file, "rb") do |fi|
  while buffer = fi.read(BLOCKSIZE_TO_READ)
    tf.write buffer
  end
end

Here's what the documentation says about read:

If length is a positive integer, it try to read length bytes without any conversion (binary mode). It returns nil or a string whose length is 1 to length bytes. nil means it met EOF at beginning. The 1 to length-1 bytes string means it met EOF after reading the result. The length bytes string means it doesn’t meet EOF. The resulted string is always ASCII-8BIT encoding.

An additional problem is it looks like you're not opening the output file correctly:

tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","w")

You're writing it in "text" mode because of "w". Instead, you need to write in binary mode, "wb", because tarballs contain binary (compressed) data:

tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","wb")

Rewriting the original code to be more like I'd want to see it, results in:

BLOCKSIZE_TO_READ = 1024 * 1000

def create_tarball(path)

  tar_filename = Pathname.new(path).realpath.to_path + '.tar'

  File.open(tar_filename, 'wb') do |tarfile|

    Gem::Package::TarWriter.new(tarfile) do |tar|

      Dir[File.join(path, '**/*')].each do |file|

        mode = File.stat(file).mode
        relative_file = file.sub(/^#{ Regexp.escape(path) }\/?/, '')

        if File.directory?(file)
          tar.mkdir(relative_file, mode)
        else

          tar.add_file(relative_file, mode) do |tf|
            File.open(file, 'rb') do |f|
              while buffer = f.read(BLOCKSIZE_TO_READ)
                tf.write buffer
              end
            end
          end

        end
      end
    end
  end

  tar_filename

end

BLOCKSIZE_TO_READ should be at the top of your file since it's a constant and is a "tweakable" - something more likely to be changed than the body of the code.

The method returns the path to the tarball, not an IO handle like the original code. Using the block form of IO.open automatically closes the output, which would cause any subsequent open to automatically rewind. I much prefer passing around path strings than IO handles for files.

I also wrapped some of the method parameters in enclosing parenthesis. While parenthesis aren't required around method parameters in Ruby, and some people eschew them, I think they make the code more maintainable by delimiting where the parameters start and end. They also avoid confusing Ruby when you're passing parameters and a block to a method -- a well-known cause for bugs.

Upvotes: 3

James
James

Reputation: 4737

minitar looks like it writes to a stream so I don't think memory will be a problem. Here is the comment and definition of the pack method (as of May 21, 2013):

# A convenience method to pack files specified by +src+ into +dest+. If
# +src+ is an Array, then each file detailed therein will be packed into
# the resulting Archive::Tar::Minitar::Output stream; if +recurse_dirs+
# is true, then directories will be recursed.
#  
# If +src+ is an Array, it will be treated as the argument to Find.find;
# all files matching will be packed.
def pack(src, dest, recurse_dirs = true, &block)
  Output.open(dest) do |outp|
    if src.kind_of?(Array)
      src.each do |entry|
        pack_file(entry, outp, &block)
        if dir?(entry) and recurse_dirs
          Dir["#{entry}/**/**"].each do |ee| 
            pack_file(ee, outp, &block)
          end                                                                                                                                                                                                                   
        end  
      end  
    else 
      Find.find(src) do |entry|
        pack_file(entry, outp, &block)
      end  
    end  
  end
end

Example from the README to write a tar:

# Packs everything that matches Find.find('tests')
File.open('test.tar', 'wb') { |tar| Minitar.pack('tests', tar) }

Example from the README to write a gzipped tar:

tgz = Zlib::GzipWriter.new(File.open('test.tgz', 'wb'))
  # Warning: tgz will be closed!
Minitar.pack('tests', tgz) 

Upvotes: 1

Related Questions