Dean Hiller
Dean Hiller

Reputation: 20190

Upload to GCP Storage signed url chunk by chunk for file

I was reading this post -> upload to google cloud storage signed url with javascript

and it reads the entire file into the reader, then seems to send the entire file. Is there a way instead to read a chunk, send a chunk with GCP Storage signed urls? In this way, we do not blow memory on a very large file and can do a progress bar as well as we upload?

We are fine with any javascript client as we do not currently use any right now.

thanks, Dean

Upvotes: 0

Views: 3221

Answers (2)

Markus Mobius
Markus Mobius

Reputation: 11

We are doing chunked uploads with composing - so we chunk the file and create a signed URL for every chunk. These chunks are then composed.

Here is a fully working C# example for chunked upload and download of a test file to a Google cloud storage bucket (it took me a long time to put my original solution together because I didn't find much online). To compile you need to install from Nuget:

https://www.nuget.org/packages/MimeTypes

https://www.nuget.org/packages/Crc32.NET/1.2.0/

You also need to install the Google Cloud storage API https://www.nuget.org/packages/Google.Cloud.Storage.V1/

Finally it is assumed that you have a JSON file with credentials downloaded from the Google cloud console (here it is called credentials.json).

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Google.Cloud.Storage.V1;
using Google.Apis.Storage.v1.Data;
using System.Net.Http;
using System.Net.Http.Headers;
using System.IO;
using System.Xml;
using System.Web;
using Google.Apis.Auth.OAuth2;
using System.Security.Cryptography;
using Force.Crc32;

namespace GoogleCloudPOC
{
    class Program
    {

        static StorageClient storage;
        static UrlSigner urlSigner;
        static string bucketName = "ratiodata";
        
        static void Main(string[] args)
        {
            var credential = GoogleCredential.FromFile("credentials.json");
            storage = StorageClient.Create(credential);
            urlSigner = UrlSigner.FromServiceAccountPath("credentials.json");

            //create a dummy file
            var arr = new byte[1000000];
            var r = new Random();
            for(int i = 0; i < arr.Length; i++)
            {
                arr[i] = (byte) r.Next(255);
            }

            //now upload this file in two chunks - we use two threads to illustrate that it is done in parallel
            Console.WriteLine("Starting parallel upload ...");
            string cloudFileName = "parallel_upload_test.dat";
            var threadpool = new Thread[2];
            int offset = 0;
            int buflength = 100000;
            int blockNumber = 0;
            var blockList = new SortedDictionary<int, string>();
            for(int t = 0; t < threadpool.Length; t++)
            {
                threadpool[t] = new Thread(delegate ()
                {
                    while (true)
                    {
                        int currentOffset = -1;
                        int currentBlocknumber = -1;
                        lock (arr)
                        {
                            if (offset >= arr.Length) { break; }
                            currentOffset = offset;
                            currentBlocknumber = blockNumber;
                            offset += buflength;
                            blockNumber++;
                        }
                        int len = buflength;
                        if (currentOffset + len > arr.Length)
                        {
                            len = arr.Length - currentOffset;
                        }
                        //create signed url
                        var dict = new Dictionary<string, string>();
                        //calculate hash
                        var crcHash = Crc32CAlgorithm.Compute(arr, currentOffset, len);
                        var b = BitConverter.GetBytes(crcHash);
                        if (BitConverter.IsLittleEndian)
                        {
                            Array.Reverse(b);
                        }
                        string blockID = $"__TEMP__/{cloudFileName.Replace('/', '*')}.part_{currentBlocknumber}_{Convert.ToBase64String(b)}";
                        lock (blockList)
                        {
                            blockList.Add(currentBlocknumber, blockID);
                        }
                        dict.Add("x-goog-hash", $"crc32c={Convert.ToBase64String(b)}");
                        //add custom time
                        var dt = DateTimeOffset.UtcNow.AddHours(-23); //cloud storage will delete the temp files 6 hours after through lifecycle policy (if set to 1 day after custom time)
                        var CustomTime = String.Format("{0:D4}-{1:D2}-{2:D2}T{3:D2}:{4:D2}:{5:D2}.{6:D2}Z", dt.Year, dt.Month, dt.Day, dt.Hour, dt.Minute, dt.Second, dt.Millisecond / 10);
                        dict.Add("x-goog-custom-time", CustomTime);
                        var signedUrl = getSignedUrl(blockID, 1, "upload", dict);
                        //now perform the actual upload with this URL - this part could run in the browser as well
                        using (var client = new HttpClient())
                        {
                            var content = new ByteArrayContent(arr, currentOffset, len);
                            content.Headers.ContentType = MediaTypeHeaderValue.Parse("application/octet-stream");
                            foreach (var kvp in dict)
                            {
                                client.DefaultRequestHeaders.Add(kvp.Key, kvp.Value);
                            }
                            var response = client.PutAsync(signedUrl, content).Result;
                            if (!response.IsSuccessStatusCode)
                            {
                                throw new Exception("upload failed"); //this should be replaced with some sort of exponential backoff
                            }
                        }

                    }
                });
                threadpool[t].Start();
            }
            for (int t = 0; t < threadpool.Length; t++)
            {
                threadpool[t].Join();
            }
            //now we compose the chunks into a single file - we can do at most 32 at a time
            BlobCombine(blockList.Values.ToArray(), cloudFileName);
            Console.WriteLine("... parallel upload finished");

            //now use chunked download
            Console.WriteLine("Starting parallel download ...");
            var downloadedArr = new byte[arr.Length];
            threadpool = new Thread[2];
            offset = 0;
            buflength = 200000;
            var downloadUrl = getSignedUrl(cloudFileName, 1, "download"); //single download URL is sufficient
            for (int t = 0; t < threadpool.Length; t++)
            {
                threadpool[t] = new Thread(delegate ()
                {
                    while (true)
                    {
                        int currentOffset = -1;
                        lock (downloadedArr)
                        {
                            if (offset >= arr.Length) { break; }
                            currentOffset = offset;
                            offset += buflength;
                        }
                        int len = buflength;
                        if (currentOffset + len > downloadedArr.Length)
                        {
                            len = downloadedArr.Length - currentOffset;
                        }

                        //now perform the actual download with this URL - this part could run in the browser as well
                        var tags = new Dictionary<string, string>();
                        tags.Add("Range", $"bytes={currentOffset}-{currentOffset + len - 1}");
                        using (var client = new HttpClient())
                        {
                            var request = new HttpRequestMessage { RequestUri = new Uri(downloadUrl) };
                            foreach (var kvp in tags)
                            {
                                client.DefaultRequestHeaders.Add(kvp.Key, kvp.Value);
                            }
                            var response = client.SendAsync(request).Result;
                            var buffer = new byte[len];
                            lock (downloadedArr)
                            {
                                response.Content.ReadAsStream().Read(buffer, 0, len);
                            }
                            lock (downloadedArr)
                            {
                                Array.Copy(buffer, 0, downloadedArr, currentOffset, len);
                            }
                        }
                    }
                });
                threadpool[t].Start();
            }
            for (int t = 0; t < threadpool.Length; t++)
            {
                threadpool[t].Join();
            }
            Console.WriteLine("... parallel download finished");

            //compare original array and downloaded array
            for(int i = 0; i < arr.Length; i++)
            {
                if (arr[i] != downloadedArr[i])
                {
                    throw new Exception("download is different from original data");
                }
            }
            Console.WriteLine("good job: original and downloaded data are the same!");
        }



        static string getSignedUrl(string cloudFileName, int hours, string capability, Dictionary<string, string> tags = null)
        {
            string url = null;
            switch (capability)
            {
                case "download":
                    url = urlSigner.Sign(bucketName, cloudFileName, TimeSpan.FromHours(hours), HttpMethod.Get);
                    break;
                case "upload":
                    var requestHeaders = new Dictionary<string, IEnumerable<string>>();
                    if (tags != null)
                    {
                        foreach (var kvp in tags)
                        {
                            requestHeaders.Add(kvp.Key, new[] { kvp.Value });
                        }
                    }
                    UrlSigner.Options options = UrlSigner.Options.FromDuration(TimeSpan.FromHours(hours));
                    UrlSigner.RequestTemplate template = UrlSigner.RequestTemplate
                        .FromBucket(bucketName)
                        .WithObjectName(cloudFileName).WithHttpMethod(HttpMethod.Put);
                    if (requestHeaders.Count > 0)
                    {
                        template = template.WithRequestHeaders(requestHeaders);
                    }
                    url = urlSigner.Sign(template, options);
                    break;
                case "delete":
                    url = urlSigner.Sign(bucketName, cloudFileName, TimeSpan.FromHours(hours), HttpMethod.Delete);
                    break;
            }
            return url;
        }

        static bool BlobCombine(string[] inputFiles, string outputFile)
        {
            var sourceObjects = new List<ComposeRequest.SourceObjectsData>();
            foreach (var fn in inputFiles)
            {
                sourceObjects.Add(new ComposeRequest.SourceObjectsData { Name = fn });
            }
            while (sourceObjects.Count > 32)
            {
                var prefix = sourceObjects.First().Name.Split('.').First();
                var newSourceObjects = new List<ComposeRequest.SourceObjectsData>();
                var currentSplit = new List<ComposeRequest.SourceObjectsData>();
                var sb = new StringBuilder();
                for (int i = 0; i < sourceObjects.Count; i++)
                {
                    sb.Append(sourceObjects[i].Name.Split('.').Last());
                    currentSplit.Add(sourceObjects[i]);
                    if (currentSplit.Count == 32)
                    {
                        var targetName = $"{prefix}.{HashStringOne(sb.ToString())}";
                        if (!condense(currentSplit, targetName, false))
                        {
                            return false;
                        }
                        newSourceObjects.Add(new ComposeRequest.SourceObjectsData() { Name = targetName });
                        currentSplit = new List<ComposeRequest.SourceObjectsData>();
                        sb = new StringBuilder();
                    }
                }
                if (currentSplit.Count == 1)
                {
                    newSourceObjects.Add(currentSplit[0]);
                }
                if (currentSplit.Count > 1)
                {
                    var targetName = $"{prefix}.{HashStringOne(sb.ToString())}";
                    if (!condense(currentSplit, targetName, false))
                    {
                        return false;
                    }
                    newSourceObjects.Add(new ComposeRequest.SourceObjectsData() { Name = targetName });
                }
                sourceObjects = newSourceObjects;
            }
            return condense(sourceObjects, outputFile, true);
        }

        static ulong HashStringOne(string s)
        {
            ulong hash = 0;

            for (int i = 0; i < s.Length; i++)
            {
                hash += (ulong)s[i];
                hash += (hash << 10);
                hash ^= (hash >> 6);
            }

            hash += (hash << 3);
            hash ^= (hash >> 11);
            hash += (hash << 15);
            return hash;
        }


        static bool condense(List<ComposeRequest.SourceObjectsData> input, string targetName, bool lastRound)
        {
            try
            {
                storage.Service.Objects.Compose(new ComposeRequest
                {
                    SourceObjects = input                   
                }, bucketName, targetName).Execute();
                if (!lastRound)
                {
                    //set custom time
                    var file = storage.GetObject(bucketName, targetName);
                    file.CustomTime = DateTime.UtcNow.AddHours(-23);
                    file = storage.UpdateObject(file);
                }
                else
                {
                    //try to set mime type based on file extensions
                    var file = storage.GetObject(bucketName, targetName);
                    file.ContentType = MimeTypes.GetMimeType(targetName);
                    file = storage.UpdateObject(file);
                }
                return true;
            }
            catch (Exception e)
            {
                return false;
            }
        }

    }
}

The upload is performed in parallel using signed URLs. Even though this is a C# command line program you could easily put that code into some ASP net core backend. There are a few lines code of code where the actual upload/download happens using httpclient - those could be done in Javascript in the browser.

The only thing that has to run on the backend is creating signed URLs - plus the compositing of the chunks (this could probably be done in the browser - but this typically isn't heavy operation and Google recommends to do these operations not using signed Urls).

Note, that you have to create a different signed URL for each upload chunk - but a single signed url is sufficient for the download.

Also note that the composition code is a bit involved because you can only combine up to 32 chunks into a new object on cloud storage - hence you might need a few rounds of composition (you can compose objects that are already composed).

I am including CRC32C hashes in the upload to make sure it's uploaded correctly. There should be some Javascript library to perform this in the browser. If you run this in the browser you need to send the hash to the backend when requesting a signed upload url because this parameter is embedded in the put header and has to be encrypted as part of the signed url.

The custom time is included and set to -23 hours from current time so that you can set a lifecycle rule on your bucket which deletes the temporary chunks one day after custom time (effectively it will be a few hours later even though it should be 1 hour after creating the chunk). You can also manually delete the chunks but I would use the custom time approach anyway to make sure you are not gunking up your bucket with failed uploads.

The above approach is truly parallel upload/download. If you just care about chunking (for a progress bar say) but you don't care about parallel threads doing the upload/download then a resumable upload is possible (you would still use the same download approach as outlined above). Such an upload has to be initiated with a single POST call and then you can upload the file chunk by chunk (similar to the way the download code works).

Upvotes: 0

RJC
RJC

Reputation: 1338

A resumable uploads work by sending multiple requests, each of which contains a portion of the object you're uploading.

When working with resumable uploads, you only create and use a signed URL for the POST request that initiates the upload. This initial request returns a session URI that you use in subsequent PUT requests to upload the data. Since the session URI acts as an authentication token, the PUT requests do not use any signed URLs.

Once you've initiated a resumable upload, there are two ways to upload the object's data:

  1. In a single chunk: This approach is usually best, since it requires fewer requests and thus has better performance.
  2. In multiple chunks: Use this approach if you need to reduce the amount of data transferred in any single request, such as when there is a fixed time limit for individual requests, or if you don't know the total size of the upload at the time the upload begins.

You can use the Cloud Storage Node.js library. Do note that when using a signed URL to start a resumable upload session, you will need to specify the x-goog-resumable header with start value in the request or else signature validation will fail. Refer to this documentation for additional samples, and guides for getting a signed url to allow limited time access to a bucket.

Upvotes: 1

Related Questions