Arian Faurtosh
Arian Faurtosh

Reputation: 18521

Trying to unzip a 600mb tgz with ruby gives out of integer range error

Trying to untar a tgz file... with the following code:

tar_extract.each do |entry|
  entry_filename = File.basename(entry.full_name)
  next if entry.directory? # don't unzip directories
  next if !entry.file? # if it's not a file skip  
  next if entry.full_name.starts_with?('/') # another check

  file_path = File.join(working_directory, entry_filename)
  puts "Writing file: #{file_path}"

  File.open(file_path, 'wb') do |f|
    f.write(entry.read)
  end

  bytes = File.size(file_path)

  puts "Successfully wrote file with #{bytes} bytes"
end

tar_extract.close

This code usually works successfully, however when the file within the TGZ is too big, I get a integer out of range error.

Writing file: /files/working_dir/test1.tar.gz  
Successfully wrote file with 244704472 bytes 

Writing file: /files/working_dir/test2.sql
RangeError: integer 2556143960 too big to convert to `int'
from /usr/local/rvm/rubies/ruby-2.1.1/lib/ruby/site_ruby/2.1.0/rubygems/package/tar_reader/entry.rb:126:in `read'

I'm not sure what else I should try.

Looking at the ruby source, this is the code block:

  ##
  # Reads +len+ bytes from the tar file entry, or the rest of the entry if
  # nil

  def read(len = nil)
    check_closed

    return nil if @read >= @header.size

    len ||= @header.size - @read
    max_read = [len, @header.size - @read].min

    ret = @io.read max_read
    @read += ret.size

    ret
  end

Upvotes: 1

Views: 189

Answers (2)

Arian Faurtosh
Arian Faurtosh

Reputation: 18521

Using Joe's guidance I was able to figure it out.

I changed the File block to:

File.open(file_path, 'wb') do |f|
  while !entry.eof?
    f.write(entry.read(16000)) # 16 KB
  end
end

The reason why I choose 16KB, is because I performed a bunch of benchmark's

b = Benchmark.measure do
  File.open(file_path, 'wb') do |f|
    while !entry.eof?
      f.write(entry.read(16000)) # 16 KB
    end
  end
end

bytes = File.size(file_path)
puts("Successfully wrote file with #{bytes} bytes in #{b.real}")

After doing some research, it seems each disk has there own optimal chunk size. I had two files I used for a benchmark, a file with 211mb and one with 6.6gb. Results below, but it turned out 16KB - 64KB was the most optimal range for my disk.

2 gb // 2047483648

Successfully wrote file with 7021620216 bytes in 60.360527059

Successfully wrote file with 220613778 bytes in 2.084798686

1 gb // 1073741824

Successfully wrote file with 7021620216 bytes in 42.345642806
Successfully wrote file with 7021620216 bytes in 48.941375145
Successfully wrote file with 7021620216 bytes in 51.501044608
Successfully wrote file with 7021620216 bytes in 58.81474911

Successfully wrote file with 220613778 bytes in 1.57968424
Successfully wrote file with 220613778 bytes in 2.28171993
Successfully wrote file with 220613778 bytes in 5.905203041
Successfully wrote file with 220613778 bytes in 16.944126945

4KB // 4000

Successfully wrote file with 7021620216 bytes in 43.39409191
Successfully wrote file with 7021620216 bytes in 44.572620161
Successfully wrote file with 7021620216 bytes in 48.510513964
Successfully wrote file with 7021620216 bytes in 53.839022034

Successfully wrote file with 220613778 bytes in 1.982647292
Successfully wrote file with 220613778 bytes in 2.071772595
Successfully wrote file with 220613778 bytes in 2.132004983
Successfully wrote file with 220613778 bytes in 2.221654993

8KB // 8000

Successfully wrote file with 7021620216 bytes in 41.851550514
Successfully wrote file with 7021620216 bytes in 45.611952667
Successfully wrote file with 7021620216 bytes in 50.068614205
Successfully wrote file with 7021620216 bytes in 50.726276706

Successfully wrote file with 220613778 bytes in 1.941246687
Successfully wrote file with 220613778 bytes in 2.456356439
Successfully wrote file with 220613778 bytes in 2.56323527
Successfully wrote file with 220613778 bytes in 3.756049832

16KB // 16000

Successfully wrote file with 7021620216 bytes in 36.929413152
Successfully wrote file with 7021620216 bytes in 36.486866289
Successfully wrote file with 7021620216 bytes in 36.743103326
Successfully wrote file with 7021620216 bytes in 37.019910405

Successfully wrote file with 220613778 bytes in 1.504792162
Successfully wrote file with 220613778 bytes in 1.620161067
Successfully wrote file with 220613778 bytes in 1.622070414
Successfully wrote file with 220613778 bytes in 1.698627821


32kB // 32000

Successfully wrote file with 7021620216 bytes in 35.802759912
Successfully wrote file with 7021620216 bytes in 38.775857377
Successfully wrote file with 7021620216 bytes in 39.116311496
Successfully wrote file with 7021620216 bytes in 39.126005469

Successfully wrote file with 220613778 bytes in 1.696821094
Successfully wrote file with 220613778 bytes in 1.773727215
Successfully wrote file with 220613778 bytes in 4.023144931
Successfully wrote file with 220613778 bytes in 4.08615266


64kb // 64000

Successfully wrote file with 7021620216 bytes in 36.732343382
Successfully wrote file with 7021620216 bytes in 37.914365658
Successfully wrote file with 7021620216 bytes in 38.336098907
Successfully wrote file with 7021620216 bytes in 39.146334479

Successfully wrote file with 220613778 bytes in 1.662487522
Successfully wrote file with 220613778 bytes in 1.674177939
Successfully wrote file with 220613778 bytes in 1.745556917
Successfully wrote file with 220613778 bytes in 1.784492717

Upvotes: 0

Joe
Joe

Reputation: 42646

You can likely fix this by changing this:

  File.open(file_path, 'wb') do |f|
    f.write(entry.read)
  end

to a loop, where you call entry.read with a parameter, for the max number of bytes to process in that iteration. You might have to split into two calls, as calling entry.read may return nil, indicating there is no more data to process.

Upvotes: 1

Related Questions