Reputation: 666
I have a simple query that returns the tabels' sizes for each table in the dataset orders
:
SELECT
table_id,
TRUNC(size_bytes/1024/1024/1024/1024,2) size_tb,
FROM orders.__TABLES__
If I wish to run this query once for the whole project and all its tables, how can I do it?
I tried to change the last row to From __TABLES__
but that is an error.
Upvotes: 2
Views: 1692
Reputation: 1
Ok. Lets consider doing it in some steps:
Step 1 - List a single project and own datasets:
SELECT
string_agg(concat("SELECT * FROM `$_PROJECT_ID.", schema_name, ".__TABLES__` ")," UNION ALL \n")
FROM
`$_PROJECT_ID`.INFORMATION_SCHEMA.SCHEMATA;
OR IF ISNT FOR A SINGLE PROJECT:
Step 1.1 - List All projects consider then are been used in queries stories in last 6m (180 days):
WITH LISTA_PROJETOS AS (
SELECT DISTINCT R.PROJECT_ID
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION J, UNNEST(REFERENCED_TABLES) R
ORDER BY 1 ASC
), RESULTADOS AS (
SELECT 'SELECT \n\t' ||AGG_RESULTADOS FROM (
SELECT STRING_AGG('(SELECT STRING_AGG(CONCAT("SELECT * FROM `'||PROJECT_ID||'.", SCHEMA_NAME, ".__TABLES__` UNION ALL "), "\\n") FROM `'||PROJECT_ID||'`.INFORMATION_SCHEMA.SCHEMATA)', ' ||"\\n"||\n\t') AS AGG_RESULTADOS
FROM LISTA_PROJETOS
)
)
SELECT * FROM RESULTADOS;
If you choose the step 1.1 then you must copy all to clipboard the one line output from step 1.1 and execute it.
So you will have something like it:
SELECT * FROM `teste.raw.__TABLES__` UNION ALL
SELECT * FROM `teste.stage.__TABLES__` UNION ALL
Take care... the maximum list of unions for this query is 100. You must remove the last UNION ALL from last query for it works.
Then you should do the next step:
Step 2:
/***** Query onde será feita a consulta... *****/
SELECT
project_id,
dataset_id,
table_id,
concat(project_id,':',dataset_id,'.',table_id) objeto,
case type
when 1 then 'TABLE'
when 2 then 'VIEW'
else 'OTHER'
end as tipo,
row_count as qtd_linhas,
round(size_bytes/power(1024, 3), 2) as tamanho_gb,
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MILLIS(creation_time), 'America/Sao_Paulo') as data_criacao,
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MILLIS(last_modified_time), 'America/Sao_Paulo') as ultima_modificacao, /*Dados somente L6M (GCP)*/
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', MAX(last_query_in), 'America/Sao_Paulo') as ultima_consulta_em,
MAX(user_email) as consultado_por
FROM (
/***** HERE YOU SHOULD PASTE THE CODE OUTPUT FROM STEP 1 OR 1.1 *****/
SELECT * FROM `teste.raw.__TABLES__` UNION ALL
SELECT * FROM `teste.stage.__TABLES__`
/***** HERE YOU SHOULD PASTE THE CODE OUTPUT FROM STEP 1 OR 1.1 *****/
) AS tables
LEFT JOIN (
SELECT
creation_time AS last_query_in, user_email,
x
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION,
UNNEST(referenced_tables) AS x)
ON
project_id=x.project_id
AND x.dataset_id=dataset_id
AND x.table_id=table_id
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9 ORDER BY 2, 7
Finally you have the data you've desired. Let me know if this helps you, ok?
Upvotes: 0
Reputation: 123
You could use INFORMATION_SCHEMA
data to query
select
project_id,
TABLE_SCHEMA,
TABLE_NAME,
sum(TOTAL_PHYSICAL_BYTES) / pow(10,9) as size
from
project.region.INFORMATION_SCHEMA.TABLE_STORAGE
group by 1,2, 3
order by size DESC
Where project
is your project name and region
is region where data is located (e.g. region-us
). Refer to https://cloud.google.com/bigquery/docs/information-schema-table-storage for more info
Upvotes: 2
Reputation: 3592
I use this Python script for something similar (probably originate in Stackoverflow) with my adjustments
from google.cloud import bigquery
client = bigquery.Client()
datasets = list(client.list_datasets())
project = client.project
sizes = []
if datasets:
print('Datasets in project {}:'.format(project))
for dataset in datasets: # API request(s)
print('Dataset: {}'.format(dataset.dataset_id))
query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")
results = query_job.result()
for row in results:
print("\tTable: {} : {}".format(row.table_id, row.size))
item = {
'project': project,
'dataset': dataset.dataset_id,
'table': row.table_id,
'size': row.size
}
sizes.append(item)
else:
print('{} project does not contain any datasets.'.format(project))
Upvotes: 2