Joe Scotto
Joe Scotto

Reputation: 10857

BigQuery select __TABLES__ from all tables within project?

Using BigQuery, is there a way I can select __TABLES__ from every dataset within my project? I've tried SELECT * FROM '*.__TABLES' but that is not allowed within BigQuery. Any help would be great, thanks!

Upvotes: 9

Views: 43436

Answers (10)

wallstalk
wallstalk

Reputation: 83

You can query all tables in all datasets with a single query from a project. Note that it only works if all tables are in a single region.

DECLARE table_list ARRAY<STRING>;
DECLARE query_tables STRING;

-- loop through all datasets
FOR var_dataset IN (
  SELECT
    catalog_name,
    schema_name
  FROM
    `${project_id}`.`region-${region}`.INFORMATION_SCHEMA.SCHEMATA
)
DO

  -- get all raw tables in the dataset
  SET query_tables = FORMAT(
    """SELECT ARRAY_AGG(table_id), FROM `%s.%s`.__TABLES__;""",
    var_dataset.catalog_name, 
    var_dataset.schema_name
    );
  EXECUTE IMMEDIATE query_tables INTO table_list;

  -- loop through the tables inside of a dataset
  FOR var_table IN (
    WITH expression AS (
      SELECT table_list
    )
    SELECT table_id FROM expression, UNNEST(table_list) AS table_id
  )
  DO
    --- do whatever you want with each table
    SELECT var_dataset.catalog_name, var_dataset.schema_name, var_table.table_id;
  END FOR;
  
END FOR;
    

Using FOR...IN to loop through datasets in my project. Nesting it with a another FOR...IN to loop through tables.

Inside of the first Loop Use EXECUTE IMMEDIATE to get all of the tables from a dataset. Note that it only allows 1 row to be saved into a variable, so adding all of the returned tables into an ARRAY_AGG on a single row.

Then the the single row can be transformed into a "table_expression" using WITH statement. "table_expression" is expected type of parameter that FOR...IN can loop through.

I used this to dynamically create snapshots of all my tables in a project. Although a thing to consider is how many iterations can this loop have. I read somewhere about a million, but can't seem to find the number in the official documentations.

Upvotes: 0

Vadiraj k.s
Vadiraj k.s

Reputation: 59

slight changes referring to > @Dinh Tran answer:

#!/bin/bash
project_name="abc-project-name"
echo -e "project_id,dataset_id,table_id,creation_time,last_modified_time,row_count,size_mb,size_gb,type,partiton,partition_expiration_days,cluster_key" > /tmp/bq_out.csv
for dataset in $(bq ls|tail -n +3); do
bq query --format=csv --use_legacy_sql=false '
SELECT
  t1.project_id as project_id,
  t1.dataset_id as dataset_id ,
  t1.table_id as table_id,
  TIMESTAMP_MILLIS(t1.creation_time) AS creation_time,
  TIMESTAMP_MILLIS(t1.last_modified_time) AS last_modified_time,
  t1.row_count as row_count,
  round(safe_divide(t1.size_bytes, (1000*1000)),1) as size_mb,
  round(safe_divide(t1.size_bytes, (1000*1000*1000)),2) as size_gb,
  case
        when t1.type = 1 then "table"
        when t1.type = 2 then "view"
        when t1.type = 3 then "external"
        else "?"
     END AS type,
  case 
    when t2.ddl like "%PARTITION BY%" then "Yes" 
    else "No" 
  end as partiton,
  REGEXP_EXTRACT(t2.ddl, r".*partition_expiration_days=([0-9-].*)") as partition_expiration_days,
  REGEXP_EXTRACT(t2.ddl, r"CLUSTER BY(.*)") as cluster_key,
FROM `'"${project_name}"'.'"${dataset}"'.__TABLES__` as t1,`'"${project_name}"'.'"${dataset}"'.INFORMATION_SCHEMA.TABLES` as t2
where t1.table_id=t2.table_name' | sed "1d" >> /tmp/bq_out.csv
done

Upvotes: 0

Bruno
Bruno

Reputation: 332

I've reached rate limit trying to use the accepted answer in BigQuery SQL for hundreds of datasets with thousands of tables.

Exceeded rate limits: too many concurrent dataset meta table reads per project for this project. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

and had to use the API (Python 3.7):

requirements.txt:

google-cloud-bigquery==3.7.0

main.py:

from google.cloud import bigquery

# Replace with your project ID
project_id = "your-project-id"

# Create a BigQuery client object
client = bigquery.Client(project=project_id)

# List all datasets in the project
datasets = list(client.list_datasets())

# Iterate through each dataset and list its tables
for dataset in datasets:
    print("Dataset: {}".format(dataset.dataset_id))
    tables = list(client.list_tables(dataset))
    for table in tables:
        print("\t{}".format(table.table_id))

Upvotes: 0

Cristian Avenda&#241;o
Cristian Avenda&#241;o

Reputation: 477

Maybe you can use INFORMATION_SCHEMA instead of TABLES:

SELECT * FROM region-us.INFORMATION_SCHEMA.TABLES;

Just replace region-us for the region where your datasets are. If you have more than a region, you'll need to use UNION ALL.. but it's simpler than using UNION for all the datasets.

Or you can use a query to get all the unions all, like this:

With SelectTable AS (
SELECT 1 AS ID,'SELECT * FROM '|| table_schema ||'.__TABLES__ UNION ALL' AS SelectColumn FROM region-us.INFORMATION_SCHEMA.TABLES
GROUP BY table_schema
)
Select STRING_AGG(SelectColumn,'\n') FROM SelectTable
GROUP BY ID

Upvotes: 1

Daniel Kearney
Daniel Kearney

Reputation: 31

You can extend Mikhail Berlyant's answer and automatically generate the SQL with a single query.

The INFORMATION_SCHEMA.SCHEMATA lists all the datasets. You can use a WHILE loop to generate all of the UNION ALL statements dynamically, like so:

DECLARE schemas ARRAY<string>;
DECLARE query string;
DECLARE i INT64 DEFAULT 0;
DECLARE arrSize INT64;

SET schemas = ARRAY(select schema_name from <your_project>.INFORMATION_SCHEMA.SCHEMATA);
SET query = "SELECT * FROM (";
SET arrSize = ARRAY_LENGTH(schemas);

WHILE i < arrSize - 1 DO
  SET query = CONCAT(query, "SELECT '", schemas[OFFSET(i)], "', table_ID, row_count, size_bytes from <your project>.", schemas[OFFSET(i)], '.__TABLES__ UNION ALL ');
  SET i = i + 1;
END WHILE;

SET query = CONCAT(query, "SELECT '", schemas[ORDINAL(arrSize)], "', table_ID, row_count, size_bytes from <your project>.", schemas[ORDINAL(arrSize)], '.__TABLES__` )');

EXECUTE IMMEDIATE query;

Upvotes: 3

Brent S
Brent S

Reputation: 11

Building on @mikhail-berlyant 's nice solution above it's now possible to take advantage of BigQuery's scripting features to automate gathering the list of dataset and retreiving table metadata. Simply replace the *_name variables to generate a view of metadata for all of your tables in a given project.

DECLARE project_name STRING;
DECLARE dataset_name STRING;
DECLARE table_name STRING;
DECLARE view_name STRING;
DECLARE generate_metadata_query_for_all_datasets STRING;
DECLARE retrieve_table_metadata STRING;
DECLARE persist_table_metadata STRING;
DECLARE create_table_metadata_view STRING;

SET project_name = "your-project";
SET dataset_name = "your-dataset";
SET table_name = "your-table";
SET view_name = "your-view";

SET generate_metadata_query_for_all_datasets = CONCAT("SELECT STRING_AGG( CONCAT(\"select * from `",project_name,".\", schema_name, \".__TABLES__` \"), \"union all \\n\" ) AS datasets FROM `",project_name,"`.INFORMATION_SCHEMA.SCHEMATA");
SET
  retrieve_table_metadata = generate_metadata_query_for_all_datasets;
SET create_table_metadata_view = CONCAT(
"""
 CREATE VIEW IF NOT EXISTS
`""",project_name,".",dataset_name,".",view_name,"""`
AS
 SELECT
  project_id
  ,dataset_id
  ,table_id
  ,DATE(TIMESTAMP_MILLIS(creation_time)) AS created_date
  ,TIMESTAMP_MILLIS(creation_time) AS created_at
  ,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
  ,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_at
  ,row_count
  ,size_bytes
  ,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
  ,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
  ,CASE
    WHEN type = 1 THEN 'native table'
    WHEN type = 2 THEN 'view'
    WHEN type = 3 THEN 'external table'
    ELSE 'unknown'
  END AS type
 FROM `""",project_name,".",dataset_name,".",table_name,"""`
 ORDER BY dataset_id, table_id asc""");
EXECUTE IMMEDIATE retrieve_table_metadata INTO persist_table_metadata;
EXECUTE IMMEDIATE CONCAT("CREATE OR REPLACE TABLE `",project_name,".",dataset_name,".",table_name,"` AS (",persist_table_metadata,")");
EXECUTE IMMEDIATE create_table_metadata_view;

After that you can query your new view.

    SELECT * FROM `[PROJECT ID].[DATASET ID].[VIEW NAME]`

Upvotes: 1

Tiago Peres
Tiago Peres

Reputation: 15431

Mikhail Berlyant's answer is very good. I would like to add there's a cleaner way to use in some cases.

So, if you have only one dataset, the tables are within the same dataset and they follow a pattern, you could query them using a wildcard table.

Let's say you want to query the noaa_gsod dataset (its tables have the following names gsod1929, gsod1930, ... 2018, 2019), then simply use

FROM
  `bigquery-public-data.noaa_gsod.gsod*`

This is going to match all tables in the noaa_gsod dataset that begin with the string gsod.

Upvotes: -1

Dinh Tran
Dinh Tran

Reputation: 301

You can use this SQL query to generate the list of dataset in your project:

select  string_agg(
      concat("select * from `[PROJECT ID].", schema_name, ".__TABLES__` ")
    , "union all \n"
)
from `[PROJECT ID]`.INFORMATION_SCHEMA.SCHEMATA;

You will have this list:

select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all 
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all 
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all 
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__` 
...

Then put the list within this query:

SELECT 
    table_id
    ,DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date
    ,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
    ,row_count
    ,size_bytes
    ,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
    ,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
    ,CASE
        WHEN type = 1 THEN 'table'
        WHEN type = 2 THEN 'view'
        WHEN type = 3 THEN 'external'
        ELSE '?'
     END AS type
    ,TIMESTAMP_MILLIS(creation_time) AS creation_time
    ,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time
    ,FORMAT_TIMESTAMP("%Y-%m", TIMESTAMP_MILLIS(last_modified_time)) as last_modified_month
    ,dataset_id
    ,project_id
FROM 
(   
    select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all 
    select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all 
    select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all 
    select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
)
ORDER BY dataset_id, table_id asc 

Upvotes: 20

hkanjih
hkanjih

Reputation: 1301

I know that you ask for using BigQuery, but I did a Python Script to get this information that you are asking for, maybe it is can help another coders:

Pip install:

!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client

Code:

import subprocess
import sys
import threading

from google.cloud import bigquery

def _worker_query(project, dataset_id, results_scan ):
    query_str = 'SELECT * FROM `{}.{}.__TABLES__`'.format(project, dataset_id )
    QUERY = (query_str)
    query_job = client.query(QUERY)
    rows = query_job.result()
    count=0;
    for row in rows:
        count = count+1
    results_scan.append({'dataset_id':dataset_id, 'count':count})

def main_execute():


    project = 'bigquery-public-data'
    dataset = client.list_datasets(project)
    count = 0

    threads_project = []
    results_scan = []

    for d in dataset:
        t = threading.Thread(target=_worker_query, args=(project,d.dataset_id, results_scan))
        threads_project.append(t)
        t.start()

    for t in threads_project:
        t.join()

    total_count = 0
    for result in results_scan:
        print(result)
        total_count =  total_count + result['count']

    print('\n\nTOTAL TABLES: "{}"'.format(total_count))

JSON_FILE_NAME = 'sa_bq.json'
client = bigquery.Client.from_service_account_json(JSON_FILE_NAME)
main_execute()

Upvotes: 6

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172954

__TABLES__ syntax is supported only for specific dataset and does not work across datasets

What you can do is something as below

#standardSQL
WITH ALL__TABLES__ AS (
  SELECT * FROM `bigquery-public-data.1000_genomes.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.baseball.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.bls.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.census_bureau_usa.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.cloud_storage_geo_index.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.cms_codes.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.common_us.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.fec.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.genomics_cannabis.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.ghcn_d.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.ghcn_m.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.github_repos.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.hacker_news.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.irs_990.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.medicare.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.new_york.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.nlm_rxnorm.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.open_images.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.samples.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.san_francisco.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.stackoverflow.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.usa_names.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.utility_us.__TABLES__` 
)
SELECT *
FROM ALL__TABLES__

In this case you need to know in advance list of datasets, which you can easily do via Datasets: list API or using respective bq ls

Please note: above approach will work only for datasets with data in same location. If you have datasets with data in different locations you will need to query them in two different queries

For example:

#standardSQL
WITH ALL_EU__TABLES__ AS (
  SELECT * FROM `bigquery-public-data.common_eu.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.utility_eu.__TABLES__` 
)
SELECT *
FROM ALL_EU__TABLES__

Upvotes: 10

Related Questions