user1277186
user1277186

Reputation: 115

How to efficiently split file into arbitrary byteranges in ruby

I need to upload a big file to a third party service. This third party service gives me a list of urls and byteranges:

requests = [
  {url: "https://.../part1", from: 0, to: 20_000_000},
  {url: "https://.../part2", from: 20_000_001, to: 40_000_000},
  {url: "https://.../part3", from: 40_000_001, to: 54_184_279}
]

I'm using the httpx gem to upload the data, the :body option can receive an IO or Enumerable object. I would like to split and upload chunks in an efficient way. This is why I think I should avoid writing chunks to the disks and also avoid loading the entire file into memory. I suppose that the best option would be some kind of "lazy Enumerable" but I dont know how to write the part function that would return this IO or Enumerable object.

file = File.open("bigFile", "rb")
results = requests.each do |request|
   Thread.start { HTTPX.post(request[:url]), body: part(file, request[:from], request[:to]) }
end.map(&:value)

def part(file, from, to)
   # ???
end

Upvotes: 3

Views: 289

Answers (2)

Fravadona
Fravadona

Reputation: 16980

The easiest way to generate an enumerator for each "byterange" would be to let the part function handle the opening of the file:

def part(filepath, from, to = nil, chunk_size = 4096, &block)
  return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
  size = File.size(filepath)
  to = size-1 unless to and to >= from and to < size
  io = File.open(filepath, "rb")
  io.seek(from, IO::SEEK_SET)
  while (io.pos <= to)
    size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
    chunk = io.read(size)
    yield chunk
  end
ensure
  io.close if io
end

Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)

Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from is not a multiple of the physical HDD block.

The part function now returns an Enumerator when called without a block:

part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)

And of course you can call it directly with a block:

part("bigFile", 0, 1300, 512) do |chunk|
  puts "#{chunk.inspect}"
end

Upvotes: 1

steenslag
steenslag

Reputation: 80065

IO.read("bigFile", 1000, 2000)

will read 1000 bytes, starting at offset 2000. Ruby starts counting at zero, so I think

IO.read("bigFile", 20_000_000, 0) #followed by
IO.read("bigFile,20_000_000,20_000_000) #not 20_000_001

would be correct. Without bookkeeping:

f = File.open("bigFile")
partname = "part0"
until f.eof? do
  partname = partname.succ
  chunk = f.read(20_000_000)

  #do something with chunk and partname
end
f.close

Upvotes: 0

Related Questions