ASH
ASH

Reputation: 20352

How to list sizes of all tables in all datasets in Google Big Query

I'm trying to figure out how to list all sizes of all tables in all projects in Google Big Query. Maybe it would be a SQL Union of multiple tables. Although, I'm looking at A LOT of tables here, so I'd like some kind of automated solution. I could use R code to perform this task. Or I cold even use Python to do it. If anyone here has a solution to list some metrics, primarily the size of each object (table), as well as other relevant metrics, please share it here. Thanks so much!

Upvotes: 2

Views: 6294

Answers (5)

Thiago Silva
Thiago Silva

Reputation: 36

What you want is currently unsupported on BigQuery, try this workaround:

DECLARE datasets ARRAY<STRING>;
DECLARE i INT64 DEFAULT 0;
DECLARE datasets_length INT64;

SET datasets = (SELECT ARRAY_AGG(schema_name) FROM `<project>.INFORMATION_SCHEMA.SCHEMATA`);
SET datasets_length = (SELECT ARRAY_LENGTH(datasets));

WHILE i < datasets_length DO
  EXECUTE IMMEDIATE 
  FORMAT(
    CONCAT(
      "MERGE `<project>.control.control_load` T ",
      "USING `<project>.%s.__TABLES__` S ",
      "ON T.dataset_id = S.dataset_id AND T.table_id = S.table_id AND T.last_modified_time < S.last_modified_time ",
      "WHEN MATCHED ",
        "THEN UPDATE SET last_modified_time = S.last_modified_time, row_count = S.row_count, size_bytes = S.size_bytes ",
      "WHEN NOT MATCHED ",
       "THEN INSERT ",
        "(",
          "project_id,",
          "dataset_id,",
          "table_id,",
          "creation_time,",
          "last_modified_time,",
          "row_count,",
          "size_bytes,",
          "type",
        ") ",
        "VALUES ",
        "(",
          "project_id,",
          "dataset_id,",
          "table_id,",
          "creation_time,",
          "last_modified_time,",
          "row_count,",
          "size_bytes,",
          "type",
        ")"), datasets[OFFSET(i)]);
  SET i = i + 1;
END WHILE;

Basically what I did was store all datasets in datasets array and it's length on datasets_length.

After that, I created a WHILE LOOP to run through the list of datasets and using it's index i to replace the dataset inside the query by using FORMAT and CONCAT, and then inserting all the data inside control_load table.

The query took around 20 minutes to reach conclusion for the first time, but it not just store all the tables' info of all datasets, but it tracks any change inside all the tables. I recommend using it with schedule queries, so you can run it daily automatically.

Hope it works for you like it worked for me :)

Upvotes: 0

Vadiraj k.s
Vadiraj k.s

Reputation: 59

in case if anyone needs this through bash :

#!/bin/bash
project_name="abc-project-name"
echo -e "project_id,dataset_id,table_id,row_count,size_mb,size_gb,type,partiton,partition_expiration_days,cluster_key" > /tmp/bq_out.csv
for dataset in $(bq ls|tail -n +3); do
bq query --format=csv --use_legacy_sql=false '
SELECT
  t1.project_id as project_id,
  t1.dataset_id as dataset_id ,
    t1.table_id as table_id,
    t1.row_count as row_count,
    round(safe_divide(t1.size_bytes, (1000*1000)),1) as size_mb,
    round(safe_divide(t1.size_bytes, (1000*1000*1000)),2) as size_gb,
    case
        when t1.type = 1 then "table"
        when t1.type = 2 then "view"
        when t1.type = 3 then "external"
        else "?"
     END AS type,
   case 
    when t2.ddl like "%PARTITION BY%" then "Yes" 
    else "No" 
  end as partiton,
  REGEXP_EXTRACT(t2.ddl, r".*partition_expiration_days=([0-9-].*)") as partition_expiration_days,
  REGEXP_EXTRACT(t2.ddl, r"CLUSTER BY(.*)") as cluster_key,
FROM `'"${project_name}"'.'"${dataset}"'.__TABLES__` as t1,`'"${project_name}"'.'"${dataset}"'.INFORMATION_SCHEMA.TABLES` as t2
where t1.table_id=t2.table_name' | sed "1d" >> /tmp/bq_out.csv
done

Upvotes: 0

Nick Orlove
Nick Orlove

Reputation: 559

@Enle Lin I actually found an issue with your code as it doesn't handle exceptions where a project being pulled doesn't have the BigQuery API enabled, and used the incorrect variable for name instead of projectId. So tweaked your code and converted bytes being pulled to GiB (just thought it was more relevant). Please see below:

from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

#Leverage the Application Default Credentials for authentication
credentials = GoogleCredentials.get_application_default()
service = discovery.build('cloudresourcemanager', 'v1', credentials=credentials)

#List projects
request = service.projects().list()
response = request.execute()

#Main loop to list projects
for project in response.get('projects', []):
  try:
    client = bigquery.Client(project['projectId']) # Start the client in the right project

    #Loop to list datasets
    datasets = list(client.list_datasets())
    if datasets: # If there is some BQ dataset
        print('Datasets in project {}:'.format(project['projectId']))
        #Loop to list the tables in each dataset
        for dataset in datasets:
            print(' - {}'.format(dataset.dataset_id))
            get_sizeGiB = client.query("select table_id, (size_bytes /1073741824) as sizeGiB from "+dataset.dataset_id+".__TABLES__") # This query retrieves all the tables in the dataset and the size in GiB. It can be modified to pull more fields.
            tables = get_sizeGiB.result() # Get the result
            #Loop to list the tables and print the size
            for table in tables:
                print('\t{} sizeGiB: {}'.format(table.table_id,table.sizeGiB))
    else: print ('{} project does not contain any datasets.'.format(projectId))
  except Exception:
    pass

Upvotes: 2

Tamir Klein
Tamir Klein

Reputation: 3642

Option 1

The current option as of today to do this is to use Google API to fetch projects/dataset/table information and store it in a local table. Since you mentioned you have a lot of dataset and tables I suggest you use serverless approach to achieve scalability and speed to your process

List Project

List dataset

List Table

Option 2

BigQuery now offer in their Beta program access to information schema, Check it out it might save you the time and effort

select * from `DATASET.INFORMATION_SCHEMA.TABLES`

Or

select * from `DATASET.INFORMATION_SCHEMA.COLUMNS`

enter image description here

Option 3 .

You can query __TABLES__ to get table information

select * from `project.__TABLES__`

enter image description here

Upvotes: -1

enle lin
enle lin

Reputation: 1714

This example in Python lists all the tables and their sizes in bytes in all the projects. You can take it as example to build a script that fits your use case:

from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

# credentials to list project
credentials = GoogleCredentials.get_application_default()
service = discovery.build('cloudresourcemanager', 'v1', credentials=credentials)

# list project
request = service.projects().list()
response = request.execute()

# Main loop for project
for project in response.get('projects', []):
    client = bigquery.Client(project['projectId']) # Start the client in the right project

    # list dataset
    datasets = list(client.list_datasets())
    if datasets: # If there is some BQ dataset
        print('Datasets in project {}:'.format(project['name']))
        # Second loop to list the tables in the dataset
        for dataset in datasets: 
            print(' - {}'.format(dataset.dataset_id))
            get_size = client.query("select table_id, size_bytes as size from "+dataset.dataset_id+".__TABLES__") # This query retrieve all the tables in the dataset and the size in bytes. It can be modified to get more fields.
            tables = get_size.result() # Get the result
            # Third loop to list the tables and print the result
            for table in tables:
                print('\t{} size: {}'.format(table.table_id,table.size))

Reference:

To list the projects:
https://cloud.google.com/resource-manager/reference/rest/v1/projects/list#embedded-explorer

To list the datasets:
https://cloud.google.com/bigquery/docs/datasets#bigquery-list-datasets-python

Upvotes: 6

Related Questions