Reputation: 817
I run a google BigQuery batch query in a python script in jupyter notebook. Usually, the query takes about an hour when running in interactive mode.
Today in the morning I checked and the script still displays the job is in state RUNNING
- after 16 hours. So I checked the INFORMATION_SCHEMA.JOBS
and there it says that the job is already in state DONE
, there was no error in the execution and the query took about an hour (I have another one still "running" in python, which status returned an error when investigating the INFORMATION_SCHEMA.JOBS
).
So I interrupted the kernel and checked: The dataframe where I store the results is filled, so I got the results already but the state is still showing running
.
After I explicitly asked for the job again with:
query_job_test = client.get_job(
'my_job_id', location='my_location'
)
I got the correct state DONE
.
What did I do wrong? How can I prevent my script from being stuck even when the job is already done?
See below for my code snippets:
Investigating INFORMATION_SCHEMA.JOBS
:
SELECT
*
FROM
my_project_id.region-my_region.INFORMATION_SCHEMA.JOBS
WHERE
job_id = 'my_job_id'
Python script to run the batch query:
key_path = "../path_to_my_credentials.json"
credentials = service_account.Credentials.from_service_account_file(
key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id,)
job_config = bigquery.QueryJobConfig(
priority=bigquery.QueryPriority.BATCH
)
query = """ SELECT * from my_table """
def set_df(future):
global df
df= future.to_dataframe()
query_job = client.query(query, job_config=job_config)
query_job.add_done_callback(set_df)
query_job = client.get_job(
query_job.job_id, location=query_job.location
)
while(query_job.state != 'DONE'):
time.sleep(60)
print(df.head())
UPDATE: As a workaround I changed my script to:
def set_df(future):
global df_all
global is_done
is_done = True
df_all = future.to_dataframe()
while(not 'is_done' in globals()):
time.sleep(60)
del is_done
print(df_all.head())
But still, for all my longer queries I run into the same issue with the state of the job.
Upvotes: 0
Views: 1121
Reputation: 98
Adding to @Ingar Pedersen's answer it's good to know, although the state can return 'DONE' from a finished job it can also have errors. So taking that into account:
while(query_job.state != 'DONE'):
time.sleep(60)
if query_job.errors is not None:
raise Exception("Bigquery job failed with error {}".format(query_job.errors))
query_job = client.get_job(query_job.job_id, location=query_job.location)
https://cloud.google.com/bigquery/docs/managing-jobs#bq
Upvotes: 1
Reputation: 38
You are not updating the job in the while loop. Add a client.get_job inside the loop to get an updated state and it should work:
while(query_job.state != 'DONE'):
time.sleep(60)
query_job = client.get_job(
query_job.job_id, location=query_job.location
)
Upvotes: 1