Reputation: 10857
Using BigQuery, is there a way I can select __TABLES__
from every dataset within my project? I've tried SELECT * FROM '*.__TABLES'
but that is not allowed within BigQuery. Any help would be great, thanks!
Upvotes: 9
Views: 43436
Reputation: 83
You can query all tables in all datasets with a single query from a project. Note that it only works if all tables are in a single region.
DECLARE table_list ARRAY<STRING>;
DECLARE query_tables STRING;
-- loop through all datasets
FOR var_dataset IN (
SELECT
catalog_name,
schema_name
FROM
`${project_id}`.`region-${region}`.INFORMATION_SCHEMA.SCHEMATA
)
DO
-- get all raw tables in the dataset
SET query_tables = FORMAT(
"""SELECT ARRAY_AGG(table_id), FROM `%s.%s`.__TABLES__;""",
var_dataset.catalog_name,
var_dataset.schema_name
);
EXECUTE IMMEDIATE query_tables INTO table_list;
-- loop through the tables inside of a dataset
FOR var_table IN (
WITH expression AS (
SELECT table_list
)
SELECT table_id FROM expression, UNNEST(table_list) AS table_id
)
DO
--- do whatever you want with each table
SELECT var_dataset.catalog_name, var_dataset.schema_name, var_table.table_id;
END FOR;
END FOR;
Using FOR...IN to loop through datasets in my project. Nesting it with a another FOR...IN to loop through tables.
Inside of the first Loop Use EXECUTE IMMEDIATE to get all of the tables from a dataset. Note that it only allows 1 row to be saved into a variable, so adding all of the returned tables into an ARRAY_AGG on a single row.
Then the the single row can be transformed into a "table_expression" using WITH statement. "table_expression" is expected type of parameter that FOR...IN can loop through.
I used this to dynamically create snapshots of all my tables in a project. Although a thing to consider is how many iterations can this loop have. I read somewhere about a million, but can't seem to find the number in the official documentations.
Upvotes: 0
Reputation: 59
slight changes referring to > @Dinh Tran answer:
#!/bin/bash
project_name="abc-project-name"
echo -e "project_id,dataset_id,table_id,creation_time,last_modified_time,row_count,size_mb,size_gb,type,partiton,partition_expiration_days,cluster_key" > /tmp/bq_out.csv
for dataset in $(bq ls|tail -n +3); do
bq query --format=csv --use_legacy_sql=false '
SELECT
t1.project_id as project_id,
t1.dataset_id as dataset_id ,
t1.table_id as table_id,
TIMESTAMP_MILLIS(t1.creation_time) AS creation_time,
TIMESTAMP_MILLIS(t1.last_modified_time) AS last_modified_time,
t1.row_count as row_count,
round(safe_divide(t1.size_bytes, (1000*1000)),1) as size_mb,
round(safe_divide(t1.size_bytes, (1000*1000*1000)),2) as size_gb,
case
when t1.type = 1 then "table"
when t1.type = 2 then "view"
when t1.type = 3 then "external"
else "?"
END AS type,
case
when t2.ddl like "%PARTITION BY%" then "Yes"
else "No"
end as partiton,
REGEXP_EXTRACT(t2.ddl, r".*partition_expiration_days=([0-9-].*)") as partition_expiration_days,
REGEXP_EXTRACT(t2.ddl, r"CLUSTER BY(.*)") as cluster_key,
FROM `'"${project_name}"'.'"${dataset}"'.__TABLES__` as t1,`'"${project_name}"'.'"${dataset}"'.INFORMATION_SCHEMA.TABLES` as t2
where t1.table_id=t2.table_name' | sed "1d" >> /tmp/bq_out.csv
done
Upvotes: 0
Reputation: 332
I've reached rate limit trying to use the accepted answer in BigQuery SQL for hundreds of datasets with thousands of tables.
Exceeded rate limits: too many concurrent dataset meta table reads per project for this project. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
and had to use the API (Python 3.7):
requirements.txt:
google-cloud-bigquery==3.7.0
main.py:
from google.cloud import bigquery
# Replace with your project ID
project_id = "your-project-id"
# Create a BigQuery client object
client = bigquery.Client(project=project_id)
# List all datasets in the project
datasets = list(client.list_datasets())
# Iterate through each dataset and list its tables
for dataset in datasets:
print("Dataset: {}".format(dataset.dataset_id))
tables = list(client.list_tables(dataset))
for table in tables:
print("\t{}".format(table.table_id))
Upvotes: 0
Reputation: 477
Maybe you can use INFORMATION_SCHEMA instead of TABLES:
SELECT * FROM region-us.INFORMATION_SCHEMA.TABLES;
Just replace region-us for the region where your datasets are. If you have more than a region, you'll need to use UNION ALL.. but it's simpler than using UNION for all the datasets.
Or you can use a query to get all the unions all, like this:
With SelectTable AS (
SELECT 1 AS ID,'SELECT * FROM '|| table_schema ||'.__TABLES__ UNION ALL' AS SelectColumn FROM region-us.INFORMATION_SCHEMA.TABLES
GROUP BY table_schema
)
Select STRING_AGG(SelectColumn,'\n') FROM SelectTable
GROUP BY ID
Upvotes: 1
Reputation: 31
You can extend Mikhail Berlyant's answer and automatically generate the SQL with a single query.
The INFORMATION_SCHEMA.SCHEMATA lists all the datasets. You can use a WHILE loop to generate all of the UNION ALL statements dynamically, like so:
DECLARE schemas ARRAY<string>;
DECLARE query string;
DECLARE i INT64 DEFAULT 0;
DECLARE arrSize INT64;
SET schemas = ARRAY(select schema_name from <your_project>.INFORMATION_SCHEMA.SCHEMATA);
SET query = "SELECT * FROM (";
SET arrSize = ARRAY_LENGTH(schemas);
WHILE i < arrSize - 1 DO
SET query = CONCAT(query, "SELECT '", schemas[OFFSET(i)], "', table_ID, row_count, size_bytes from <your project>.", schemas[OFFSET(i)], '.__TABLES__ UNION ALL ');
SET i = i + 1;
END WHILE;
SET query = CONCAT(query, "SELECT '", schemas[ORDINAL(arrSize)], "', table_ID, row_count, size_bytes from <your project>.", schemas[ORDINAL(arrSize)], '.__TABLES__` )');
EXECUTE IMMEDIATE query;
Upvotes: 3
Reputation: 11
Building on @mikhail-berlyant 's nice solution above it's now possible to take advantage of BigQuery's scripting features to automate gathering the list of dataset and retreiving table metadata. Simply replace the *_name variables to generate a view of metadata for all of your tables in a given project.
DECLARE project_name STRING;
DECLARE dataset_name STRING;
DECLARE table_name STRING;
DECLARE view_name STRING;
DECLARE generate_metadata_query_for_all_datasets STRING;
DECLARE retrieve_table_metadata STRING;
DECLARE persist_table_metadata STRING;
DECLARE create_table_metadata_view STRING;
SET project_name = "your-project";
SET dataset_name = "your-dataset";
SET table_name = "your-table";
SET view_name = "your-view";
SET generate_metadata_query_for_all_datasets = CONCAT("SELECT STRING_AGG( CONCAT(\"select * from `",project_name,".\", schema_name, \".__TABLES__` \"), \"union all \\n\" ) AS datasets FROM `",project_name,"`.INFORMATION_SCHEMA.SCHEMATA");
SET
retrieve_table_metadata = generate_metadata_query_for_all_datasets;
SET create_table_metadata_view = CONCAT(
"""
CREATE VIEW IF NOT EXISTS
`""",project_name,".",dataset_name,".",view_name,"""`
AS
SELECT
project_id
,dataset_id
,table_id
,DATE(TIMESTAMP_MILLIS(creation_time)) AS created_date
,TIMESTAMP_MILLIS(creation_time) AS created_at
,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_at
,row_count
,size_bytes
,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
,CASE
WHEN type = 1 THEN 'native table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external table'
ELSE 'unknown'
END AS type
FROM `""",project_name,".",dataset_name,".",table_name,"""`
ORDER BY dataset_id, table_id asc""");
EXECUTE IMMEDIATE retrieve_table_metadata INTO persist_table_metadata;
EXECUTE IMMEDIATE CONCAT("CREATE OR REPLACE TABLE `",project_name,".",dataset_name,".",table_name,"` AS (",persist_table_metadata,")");
EXECUTE IMMEDIATE create_table_metadata_view;
After that you can query your new view.
SELECT * FROM `[PROJECT ID].[DATASET ID].[VIEW NAME]`
Upvotes: 1
Reputation: 15431
Mikhail Berlyant's answer is very good. I would like to add there's a cleaner way to use in some cases.
So, if you have only one dataset, the tables are within the same dataset and they follow a pattern, you could query them using a wildcard table.
Let's say you want to query the noaa_gsod dataset (its tables have the following names gsod1929, gsod1930, ... 2018, 2019), then simply use
FROM
`bigquery-public-data.noaa_gsod.gsod*`
This is going to match all tables in the noaa_gsod dataset that begin with the string gsod.
Upvotes: -1
Reputation: 301
You can use this SQL query to generate the list of dataset in your project:
select string_agg(
concat("select * from `[PROJECT ID].", schema_name, ".__TABLES__` ")
, "union all \n"
)
from `[PROJECT ID]`.INFORMATION_SCHEMA.SCHEMATA;
You will have this list:
select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
...
Then put the list within this query:
SELECT
table_id
,DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date
,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
,row_count
,size_bytes
,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
,CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type
,TIMESTAMP_MILLIS(creation_time) AS creation_time
,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time
,FORMAT_TIMESTAMP("%Y-%m", TIMESTAMP_MILLIS(last_modified_time)) as last_modified_month
,dataset_id
,project_id
FROM
(
select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
)
ORDER BY dataset_id, table_id asc
Upvotes: 20
Reputation: 1301
I know that you ask for using BigQuery, but I did a Python Script to get this information that you are asking for, maybe it is can help another coders:
Pip install:
!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client
Code:
import subprocess
import sys
import threading
from google.cloud import bigquery
def _worker_query(project, dataset_id, results_scan ):
query_str = 'SELECT * FROM `{}.{}.__TABLES__`'.format(project, dataset_id )
QUERY = (query_str)
query_job = client.query(QUERY)
rows = query_job.result()
count=0;
for row in rows:
count = count+1
results_scan.append({'dataset_id':dataset_id, 'count':count})
def main_execute():
project = 'bigquery-public-data'
dataset = client.list_datasets(project)
count = 0
threads_project = []
results_scan = []
for d in dataset:
t = threading.Thread(target=_worker_query, args=(project,d.dataset_id, results_scan))
threads_project.append(t)
t.start()
for t in threads_project:
t.join()
total_count = 0
for result in results_scan:
print(result)
total_count = total_count + result['count']
print('\n\nTOTAL TABLES: "{}"'.format(total_count))
JSON_FILE_NAME = 'sa_bq.json'
client = bigquery.Client.from_service_account_json(JSON_FILE_NAME)
main_execute()
Upvotes: 6
Reputation: 172954
__TABLES__
syntax is supported only for specific dataset and does not work across datasets
What you can do is something as below
#standardSQL
WITH ALL__TABLES__ AS (
SELECT * FROM `bigquery-public-data.1000_genomes.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.baseball.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.bls.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.census_bureau_usa.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.cloud_storage_geo_index.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.cms_codes.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.common_us.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.fec.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.genomics_cannabis.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.ghcn_d.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.ghcn_m.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.github_repos.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.hacker_news.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.irs_990.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.medicare.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.new_york.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.nlm_rxnorm.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.open_images.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.samples.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.san_francisco.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.stackoverflow.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.usa_names.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.utility_us.__TABLES__`
)
SELECT *
FROM ALL__TABLES__
In this case you need to know in advance list of datasets, which you can easily do via Datasets: list API
or using respective bq ls
Please note: above approach will work only for datasets with data in same location. If you have datasets with data in different locations you will need to query them in two different queries
For example:
#standardSQL
WITH ALL_EU__TABLES__ AS (
SELECT * FROM `bigquery-public-data.common_eu.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.utility_eu.__TABLES__`
)
SELECT *
FROM ALL_EU__TABLES__
Upvotes: 10