Zusman
Zusman

Reputation: 666

Bigquery: get tables' sizes from all datasets

I have a simple query that returns the tabels' sizes for each table in the dataset orders:

SELECT
  table_id,
  TRUNC(size_bytes/1024/1024/1024/1024,2) size_tb,
FROM orders.__TABLES__

If I wish to run this query once for the whole project and all its tables, how can I do it? I tried to change the last row to From __TABLES__ but that is an error.

Upvotes: 2

Views: 1692

Answers (3)

Ok. Lets consider doing it in some steps:

Step 1 - List a single project and own datasets:

SELECT
    string_agg(concat("SELECT * FROM `$_PROJECT_ID.", schema_name, ".__TABLES__` ")," UNION ALL \n")
FROM
    `$_PROJECT_ID`.INFORMATION_SCHEMA.SCHEMATA;

OR IF ISNT FOR A SINGLE PROJECT:

Step 1.1 - List All projects consider then are been used in queries stories in last 6m (180 days):

WITH LISTA_PROJETOS AS (
  SELECT DISTINCT R.PROJECT_ID 
  FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION J, UNNEST(REFERENCED_TABLES) R
  ORDER BY 1 ASC
), RESULTADOS AS (
    SELECT 'SELECT \n\t' ||AGG_RESULTADOS FROM (
      SELECT STRING_AGG('(SELECT STRING_AGG(CONCAT("SELECT * FROM `'||PROJECT_ID||'.", SCHEMA_NAME, ".__TABLES__` UNION ALL "), "\\n") FROM `'||PROJECT_ID||'`.INFORMATION_SCHEMA.SCHEMATA)', ' ||"\\n"||\n\t') AS AGG_RESULTADOS
    FROM LISTA_PROJETOS
  )
)
SELECT * FROM RESULTADOS;

If you choose the step 1.1 then you must copy all to clipboard the one line output from step 1.1 and execute it.

So you will have something like it:

SELECT * FROM `teste.raw.__TABLES__` UNION ALL 
SELECT * FROM `teste.stage.__TABLES__` UNION ALL

Take care... the maximum list of unions for this query is 100. You must remove the last UNION ALL from last query for it works.

Then you should do the next step:

Step 2:

/***** Query onde será feita a consulta... *****/
SELECT
  project_id,
  dataset_id,
  table_id,
  concat(project_id,':',dataset_id,'.',table_id) objeto,
  case type
    when 1 then 'TABLE'
    when 2 then 'VIEW'
    else 'OTHER' 
  end as tipo,
  row_count as qtd_linhas,
  round(size_bytes/power(1024, 3), 2) as tamanho_gb,
  FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MILLIS(creation_time), 'America/Sao_Paulo') as data_criacao,
  FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MILLIS(last_modified_time), 'America/Sao_Paulo') as ultima_modificacao, /*Dados somente L6M (GCP)*/
  FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', MAX(last_query_in), 'America/Sao_Paulo') as ultima_consulta_em,
  MAX(user_email) as consultado_por
FROM (
  /***** HERE YOU SHOULD PASTE THE CODE OUTPUT FROM STEP 1 OR 1.1 *****/
SELECT * FROM `teste.raw.__TABLES__` UNION ALL 
SELECT * FROM `teste.stage.__TABLES__`
  /***** HERE YOU SHOULD PASTE THE CODE OUTPUT FROM STEP 1 OR 1.1 *****/
) AS tables
LEFT JOIN (
  SELECT
    creation_time AS last_query_in, user_email,
    x
  FROM
    `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION,
    UNNEST(referenced_tables) AS x)
ON
  project_id=x.project_id
  AND x.dataset_id=dataset_id
  AND x.table_id=table_id
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9 ORDER BY 2, 7

Finally you have the data you've desired. Let me know if this helps you, ok?

Upvotes: 0

Gefest
Gefest

Reputation: 123

You could use INFORMATION_SCHEMA data to query

select 
  project_id,
  TABLE_SCHEMA,
  TABLE_NAME,
  sum(TOTAL_PHYSICAL_BYTES) / pow(10,9) as size
from 
  project.region.INFORMATION_SCHEMA.TABLE_STORAGE 
group by 1,2, 3
order by size DESC

Where project is your project name and region is region where data is located (e.g. region-us). Refer to https://cloud.google.com/bigquery/docs/information-schema-table-storage for more info

Upvotes: 2

y0j0
y0j0

Reputation: 3592

I use this Python script for something similar (probably originate in Stackoverflow) with my adjustments

from google.cloud import bigquery

client = bigquery.Client()

datasets = list(client.list_datasets())
project = client.project

sizes = []

if datasets:
    print('Datasets in project {}:'.format(project))
    for dataset in datasets:  # API request(s)
        print('Dataset: {}'.format(dataset.dataset_id))

        query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")

        results = query_job.result()
        for row in results:
            print("\tTable: {} : {}".format(row.table_id, row.size))
            
            item = {
                'project': project,
                'dataset': dataset.dataset_id,
                'table': row.table_id,
                'size': row.size
            }
            sizes.append(item)

else:
    print('{} project does not contain any datasets.'.format(project))

Upvotes: 2

Related Questions