Reputation: 27526
I am trying to get information on the number of queries running in BigQuery, and their states.
I've tried this:
select count(job_id) as job_count, state
from `MY_PROJECT`.`region-MY_REGION`.INFORMATION_SCHEMA.JOBS
where creation_time between
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND CURRENT_TIMESTAMP()
and query not like 'select count(job_id) as job_count, state%'
and project_id = 'MY_PROJECT'
group by state;
but I only seem to get counts for queries submitted by my user. I know there is more activity than that because when I go to the monitoring page, the "Jobs Explorer", I can see jobs submitted by different service accounts as well, and how many are in each state. The fact that I can see these jobs in the UI makes me doubt that it is a permissions-related issue (unless queries require different permissions than the UI).
I need to be able to get that information programmatically instead of looking at the UI, and I am not sure why the query above doesn't do it.
I thought I would try using the bigquery client, from Python:
from google.cloud import bigquery
from datetime import datetime, timedelta
project_id = 'MY_PROJECT'
client = bigquery.Client()
jobs = client.list_jobs(project=project_id,
all_users=True,
min_creation_time=datetime.now() - timedelta(hours=1),
max_creation_time=datetime.now())
states = []
job_counts = {}
# Iterate through the jobs and count the states
for job in jobs:
if job.state not in states:
states.append(job.state)
job_counts[job.state] = 0
job_counts[job.state] += 1
for state, count in job_counts.items():
print(f'Jobs in state {state}: {count}')
And this does not give the same results as the SQL query above.
What is the best way to get the number of jobs and their states in bigquery, programmatically?
Upvotes: 0
Views: 110
Reputation: 27526
For anyone who was following, the code solution worked. What I finally used was this:
def get_job_statuses(project_id: str, region: str):
client = bigquery.Client()
jobs = client.list_jobs(project=project_id,
all_users=True,
min_creation_time=datetime.now() - timedelta(minutes=5),
max_creation_time=datetime.now())
states = []
job_counts_by_user = {}
job_counts = {}
job_priorities = {}
for job in jobs:
if job.state not in states:
states.append(job.state)
job_counts[job.state] = 0
job_counts[job.state] += 1
if job.priority not in job_priorities:
job_priorities[job.priority] = 0
job_priorities[job.priority] += 1
if job.user_email not in job_counts_by_user:
job_counts_by_user[job.user_email] = 0
job_counts_by_user[job.user_email] += 1
logging.info(f"{job_counts}")
logging.info(f"{job_priorities}")
logging.info(f"{job_counts_by_user}")
client.close()
return job_counts
I think there was some subtle bug in the original API-based solution I posted, I just can't remember what it was.
Upvotes: 0
Reputation: 3505
INFORMATION_SCHEMA.JOBS view only shows information for queries submitted by the current user. To see data for all users in your project, you need to use the INFORMATION_SCHEMA.JOBS_BY_PROJECT view instead.
SELECT COUNT(job_id) AS job_count, state
FROM `MY_PROJECT`.`region-MY_REGION`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND CURRENT_TIMESTAMP()
AND query NOT LIKE 'SELECT COUNT(job_id) AS job_count, state%'
AND project_id = 'MY_PROJECT'
GROUP BY state;
Upvotes: -1