Reputation: 115
I need to upload a big file to a third party service. This third party service gives me a list of urls and byteranges:
requests = [
{url: "https://.../part1", from: 0, to: 20_000_000},
{url: "https://.../part2", from: 20_000_001, to: 40_000_000},
{url: "https://.../part3", from: 40_000_001, to: 54_184_279}
]
I'm using the httpx gem to upload the data, the :body
option can receive an IO
or Enumerable
object.
I would like to split and upload chunks in an efficient way. This is why I think I should avoid writing chunks to the disks and also avoid loading the entire file into memory. I suppose that the best option would be some kind of "lazy Enumerable" but I dont know how to write the part
function that would return this IO
or Enumerable
object.
file = File.open("bigFile", "rb")
results = requests.each do |request|
Thread.start { HTTPX.post(request[:url]), body: part(file, request[:from], request[:to]) }
end.map(&:value)
def part(file, from, to)
# ???
end
Upvotes: 3
Views: 289
Reputation: 16980
The easiest way to generate an enumerator for each "byterange" would be to let the part
function handle the opening of the file:
def part(filepath, from, to = nil, chunk_size = 4096, &block)
return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
size = File.size(filepath)
to = size-1 unless to and to >= from and to < size
io = File.open(filepath, "rb")
io.seek(from, IO::SEEK_SET)
while (io.pos <= to)
size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
chunk = io.read(size)
yield chunk
end
ensure
io.close if io
end
Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)
Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from
is not a multiple of the physical HDD block.
The part
function now returns an Enumerator
when called without a block:
part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)
And of course you can call it directly with a block:
part("bigFile", 0, 1300, 512) do |chunk|
puts "#{chunk.inspect}"
end
Upvotes: 1
Reputation: 80065
IO.read("bigFile", 1000, 2000)
will read 1000 bytes, starting at offset 2000. Ruby starts counting at zero, so I think
IO.read("bigFile", 20_000_000, 0) #followed by
IO.read("bigFile,20_000_000,20_000_000) #not 20_000_001
would be correct. Without bookkeeping:
f = File.open("bigFile")
partname = "part0"
until f.eof? do
partname = partname.succ
chunk = f.read(20_000_000)
#do something with chunk and partname
end
f.close
Upvotes: 0