user8284384
user8284384

Reputation:

Is there a way to execute non-blocking load_job from BigQuery Python Client library?

I have a Flask API that uses Flask_restful, Flask_CORS and Marshmallow. The API does some work to get a *.csv file into Cloud Storage (using signedURL's), confirms that it has uploaded, and then creates and executes a load job to transfer the csv from Storage to BigQuery. The part of the API that is exacerbating my hair loss is the call to execute a load job in GCP that loads a csv file to BigQuery. Here is a snippet of the code:

...
            dataset_ref = bq_client.dataset(target_dataset) 
            job_config.schema =  bq_schema 
            job_config.source_format = SOURCE_FORMAT 
            job_config.field_delimiter =  DELIM  
            job_config.destination_table_description = TARGET_TABLE
            job_config.encoding = ENCODING 
            job_config.max_bad_records = MAX_BAD_RECORDS
            job_config.autodetect = False # Do not autodetect schema
            load_job = bq_client.load_table_from_uri(
                uri, dataset_ref.table(target_table), job_config=job_config
            )  # API request
            load_job.result() # **<-- This is the concern**
            return {"message": "Successfully uploaded to Bigquery"}, 200

The file can take some time to transfer, and my concern is that during periods where there is some latency, the webserver will timeout whilst waiting for the transfer to take place. I would much prefer to have load_job.result() execute, get the job ID and return a 201 response. Then I can use the job ID to poll GCP to determine whether it was successful or not, rather than have there be a risk of the request timing out for the client-side front-end and leave the user confused as to whether it succeeded or not.

I understand that load_job.result() is async, but with Flask that doesn't help. I was going to change over to Quart to use async/await but my other dependencies are not supported and therefore I will have a lot of refactoring to do. Is there another way that anyone has used to approach this type of problem? Cheers

Upvotes: 1

Views: 759

Answers (1)

guillaume blaquiere
guillaume blaquiere

Reputation: 75715

Quart solves nothing. Indeed, Quart still needs to a running environment, it waits and oversees the blocking function and calls you callback at the end. Your function must still running for performing this.

There is a better design for this. I recommend you to have a look to Cloud Task. The process is the following:

  • Run your load job
  • Create your task with the load job ID in parameter
  • Exit the function
  • Task will trigger another function that will check if the job is over
    • If not yet finish, return an error code (different than 2XX).
    • If finish, return a OK return code (2XX)

You have to set up your Cloud Task with retry policy to not retry immediately (for example set the min-backoff to 30s)

Upvotes: 1

Related Questions