Reputation: 12808
I want to get more grip on costs in GCP and one way of doing that is to categorize / classify all bigquery jobs and queries into groups, for example the application or user that fired the query.
This would help me to see where most of my query costs are coming from.
It could be a BI tool like Tableau, or scheduled queries or a user with very large queries.
To get that info I am using:
SELECT *
FROM `your_project_id.region-eu.INFORMATION_SCHEMA.JOBS`
The label field in there is not really informative, so how can I recognize where query jobs originated from?
Upvotes: 3
Views: 2047
Reputation: 12808
You can categorize queries using the job_id
, user_email
and job_type
.
A starting point for classifying jobs based on these fields could be using this:
case
-- typical bq jobs
when starts_with(job_id, 'materialized_view_refresh_') then 'materialized_view_refresh' -- this is run by: [email protected]
when starts_with(job_id, 'scheduled_query_') then 'scheduled_query'
when starts_with(job_id, 'script_job_') then 'script_job' -- every script job has a parent job
when starts_with(job_id, 'bquxjob_') then 'bq user interface' -- jobs run in the UI interface of BigQuery
when starts_with(job_id, 'bqjob_') then 'bigquery load or copy job - somewhat unclear' -- i see some manual? copy and load jobs that get this prefix
when starts_with(job_id, 'dataform-gcp') then 'Dataform job'
when starts_with(job_id, 'clouddq-') then 'Cloud Composer job'
when starts_with(job_id, 'job_') AND job_type = 'EXTRACT' then 'BQ UI data export to Google Drive' -- this is an educated guess, please note that most job ids that start with 'job_' are made by Tableau or LookerStudio
-- data transfers -- are they all using gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com?
when starts_with(job_id, 'adwords_') then 'adwords' -- Adwords job that runs automatically in Data Transfers
when starts_with(job_id, 'google_ads_') then 'google_ads' -- Google Ads job that runs autotomatically in Data Transfers
when starts_with(job_id, 'merchant_center_') then 'merchant_center' -- Merchant Center job that runs automatically in Data Transfers
when starts_with(job_id, '123456-0000-1234-baa9-1234463245') then 'billing data transfer' -- this id refers to your datasource of the billing data / logs
when starts_with(job_id, 'bqts_') then 'Dataset Copy in Data Transfers' -- so within Data Transfers you've created a Dataset Copy task
-- google analytics exports, account numbers are for example 1234567 and 7654321
-- for GA4 however the account number 222222 might inside the job_id, when there's another prefix
when starts_with(job_id, 'your ga account number') then 'google analytics' -- this is run by: [email protected]
when starts_with(job_id, 'some_number_and_then_the_word_analytics') or contains_substr(job_id, 'your_account_number') then 'google analytics' -- this is run by: [email protected]
-- 3rd party jobs that use a job_prefix
when starts_with(job_id, 'sheets_dataconnector_') then 'sheets_dataconnector' -- jobs run from google sheets dataconnector
when starts_with(job_id, 'job-exponea') then 'job-exponea' -- jobs run from Exponea
when starts_with(job_id, 'job_') then 'tableau or lookerstudio job' -- Tableau and Looker Studio seem to use job_ as a prefix
-- recognize job based on other characteristics - when no job prefix is used
when starts_with(query, '/* {"app":"dbt"') then 'dbt job' -- dbt cloud jobs can be recognized by this comment added to the query
when user_email = '[email protected]' then 'job based on a certain user email'
-- programmatic jobs without prefix, run for example with python or notebooks or cloud functions using the bq_client.
-- jobs have a pattern like this: 11cd23fa-1bc5-3fz3-a22d-f1a918660f1a
-- putting this at the end when i can't find any other categorization
when regexp_contains(job_id, '^[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}$') then 'python bq client jobs'
-- everything else, for example python jobs that have a different formatting of jobs
else 'unknown'
end job_classification
Upvotes: 2