Reputation:
Here is how I normally download a GCS file to local:
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
blob.download_to_filename('myBigFile.txt)
The files that I am working with a much, much larger than the allowable size/memory of the Cloud Functions (for example, several GBs to several TBs), so the above would not work for these large files.
Is there a simpler, "streaming" (see example 1 below) or "direct-access" (see example 2 below) way to work with GCS files in a Cloud Function?
Two examples of what I'd be looking to do would be:
# 1. Load it in chunks of 5GB -- "Streaming"
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
while True:
data = blob.download_to_filename('myBigFile.txt', chunk_size=5GB)
do_something(data)
if not data: break
Or:
# 2. Read the data from GCS without downloading it locally -- "Direct Access"
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
with blob.read_filename('myBigFile.txt') as f:
do_something(f)
I'm not sure if either of these are possible to do, but I'm leaving a few options of how this could work. It seems like the Streaming Option is supported, but I wasn't sure how to apply it to the above case.
Upvotes: 7
Views: 4408
Reputation: 1544
As of this writing, the standard Google Cloud Client library does not support stream-like up-/download.
Have a look at GCSFS. Caveat, you may need to implement a retry strategy in case connection gets lost.
Upvotes: 1
Reputation: 39814
You might be able to achieve something close to your #1 example using the Cloud Storage XML API.
There should not be a problem implementing it inside Cloud Functions since it's entirely based on standard HTTP requests.
You're probably looking for the GET Object request to Download an Object:
GET requests for objects can include a Range header as defined in the HTTP 1.1 RFC to limit the scope of the returned data within the object, but be aware that in certain circumstances the range header is ignored.
That HTTP Range header appear to be usable to implement the "chunks" you're looking for (but as standalone requests, not in a "streaming" mode):
The range of bytes that you want returned in the response, or the range of bytes that have been uploaded to the Cloud Storage system.
Valid Values
Any contiguous range of bytes.
Example
Range: bytes=0-1999
(first 2000 bytes)
Range: bytes=-2000
(last 2000 bytes)
Range: bytes=2000-
(from byte 2000 to end of file)Implementation Details
Cloud Storage does not handle complex disjoint ranges, but it does support simple contiguous byte ranges. Also, byte ranges are inclusive; that is, bytes=0-999 represent the first 1000 bytes in a file or object. A valid and successful request will result in a 206 Partial Content response code. For more information, see the specification.
Since the ranges would be static it's unlikely you'll be able to find range values exactly fit to make the chunks perfectly match the stored data "borders". So you may need to choose the chunks overlapping a bit, to be able to capture data which otherwise would be split across 2 chunks.
Note: I didn't try this, the answer is based solely on docs.
Upvotes: 1