dekauliya
dekauliya

Reputation: 1433

How to send large file as chunked requests under Resumable Upload in GCS if the chunk gets interrupted?

Background

I am working on a project that creates an abstractions on top cloud storage and buckets. However, I had a problem of figuring out how to best support sending large files for GCS. We need the ability to send a large file in chunks, and we want to have control over the buffer/stream being sent in a chunk.

S3 has multipart upload, which allows us to send file in chunks in parallel. Unfortunately, GCS does not support this, they have composite objects which allows us to send files in parallel. However, composite objects come with various limitation. For example, unable to use customer-side encryption, MD5 digest, retention policy, having to do the temporary files clean up manually, etc. which are problematic because we want to support those things.

Resumable Upload

From documentation, the recommended way of sending a large file on GCS is via resumable uploads. Our use case would be sending a large file of unknown size in buffered chunks given that we know the size of each chunk and whether a chunk is a last part. From my understanding, the ideal case for this would be sending the first N-1 chunks with content-range=[offset-(offset+chunkSize)]/* with variable chunkSize and sending the last chunk as content-range=[offset-(offset + remainingSize)]/[TOTAL_SIZE].

My question is, what if a chunk upload gets interrupted?

What does it mean to resume an interrupted upload of a chunk? Do we send the remaining bytes of the current chunk (content-range=[lastByte-(chunkSize-lastByte)]/*) or do we send it together with the next chunk (content-range=[lastByte-(chunkSize-lastByte)+chunkSize]/*)?

Also, there is a limitation for resumable upload where each chunk must be multiple of 256KB. Does it mean an interrupt would make it impossible to keep the chunk in sync? So instead of having an expectation of having regular content-range for the chunks content-range=[offset-(offset+chunkSize)]/*, will an interrupt in resumable upload cause the remaining chunks to be sent dynamically until the last chunk?

Thank you so much.

Upvotes: 1

Views: 2596

Answers (2)

Chris Madden
Chris Madden

Reputation: 2650

In the meantime GCS added support for the Multipart Upload (MPU) API to their XML API. I think using it, instead of a Composite upload + resumable uploads for the chunks pattern, will give you a better experience.

With MPU you initiate an upload, upload chunk (aka part) files to a special hidden area of the bucket, and once done you finalize them into an object.

Individual part file uploads don't support resumable uploads, but if one fails you can upload it again, so choose a part file size that is acceptable to retransmit. An initiated Multipart Upload will stay active until finalized or aborted. You can create an Object Lifecycle Management lifecycle action AbortIncompleteMultipartUpload to clean up uploads that become stale.

The GCS client libraries for Python and node.js include a transfer_manager module that takes care of the details to upload and download in parallel; you just give it a chunk size and worker count and off it goes! For other languages either use some other S3 compatible library or call the XML API directly.

For more on the pros/cons check my recent blog: High throughput file transfers with Google Cloud Storage (GCS).

Upvotes: 0

coryan
coryan

Reputation: 826

What does it mean to resume an interrupted upload of a chunk? Do we send the remaining bytes of the current chunk (content-range=[lastByte-(chunkSize-lastByte)]/) or do we send it together with the next chunk (content-range=[lastByte-(chunkSize-lastByte)+chunkSize]/)?

It depends. There is no requirement to make all the chunk sizes the same, nor to make up your mind about them at the beginning of the upload, nor to remember what chunks you sent. As you note below there is a requirement to send all but the last chunks in sizes that are multiples of 256KiB.

To answer your question: if (chunkSize-LastByte) is a multiple of 256KiB, you could send that as a new chunk, or you may need to send the bytes from lastByte to lastByte + N * 256KiB

Also, there is a limitation for resumable upload where each chunk must be multiple of 256KB. Does it mean an interrupt would make it impossible to keep the chunk in sync?

No. What it means is that as you resume the upload the chunk boundaries may need to change.

In practice I think GCS always commits in boundaries of 256KiB, but I do not believe there is any guarantee that it will always do so.

Upvotes: 2

Related Questions