Reputation:
I have a Flask API that uses Flask_restful, Flask_CORS and Marshmallow. The API does some work to get a *.csv file into Cloud Storage (using signedURL's), confirms that it has uploaded, and then creates and executes a load job to transfer the csv from Storage to BigQuery. The part of the API that is exacerbating my hair loss is the call to execute a load job in GCP that loads a csv file to BigQuery. Here is a snippet of the code:
...
dataset_ref = bq_client.dataset(target_dataset)
job_config.schema = bq_schema
job_config.source_format = SOURCE_FORMAT
job_config.field_delimiter = DELIM
job_config.destination_table_description = TARGET_TABLE
job_config.encoding = ENCODING
job_config.max_bad_records = MAX_BAD_RECORDS
job_config.autodetect = False # Do not autodetect schema
load_job = bq_client.load_table_from_uri(
uri, dataset_ref.table(target_table), job_config=job_config
) # API request
load_job.result() # **<-- This is the concern**
return {"message": "Successfully uploaded to Bigquery"}, 200
The file can take some time to transfer, and my concern is that during periods where there is some latency, the webserver will timeout whilst waiting for the transfer to take place. I would much prefer to have load_job.result()
execute, get the job ID and return a 201 response. Then I can use the job ID to poll GCP to determine whether it was successful or not, rather than have there be a risk of the request timing out for the client-side front-end and leave the user confused as to whether it succeeded or not.
I understand that load_job.result() is async, but with Flask that doesn't help. I was going to change over to Quart to use async/await but my other dependencies are not supported and therefore I will have a lot of refactoring to do. Is there another way that anyone has used to approach this type of problem? Cheers
Upvotes: 1
Views: 759
Reputation: 75715
Quart solves nothing. Indeed, Quart still needs to a running environment, it waits and oversees the blocking function and calls you callback at the end. Your function must still running for performing this.
There is a better design for this. I recommend you to have a look to Cloud Task. The process is the following:
You have to set up your Cloud Task with retry policy to not retry immediately (for example set the min-backoff
to 30s)
Upvotes: 1