How to merge one thousand tables within a BigQuery dataset?

Question

I have a dataset within BigQuery with roughly 1000 tables, one for each variable. Each table contains two columns: observation_number, variable_name. Please note that the variable_name column assumes the actual variable name. Each table contains at least 20000 rows. What is the best way to merge these tables on the observation number?

I have developed a Python code that is going to run on a Cloud Function and it generates the SQL query to merge the tables. It does that by connecting to the dataset and looping through the tables to get all of the table_ids. However, the query ends up being too large and the performance is not that great.

Here it is the sample of the Python code that generates the query (mind it's still running locally, not yet in a Cloud Function).

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set project_id and dataset_id.
project_id = 'project-id-gcp'
dataset_name = 'sample_dataset'
dataset_id = project_id+'.'+dataset_name

dataset = client.get_dataset(dataset_id)

# View tables in dataset
tables = list(client.list_tables(dataset))  # API request(s)
table_names = []

if tables:
    for table in tables:
        table_names.append(table.table_id)
else:
    print("	This dataset does not contain any tables.")

query_start = "select "+table_names[0]+".observation"+","+table_names[0]+"."+table_names[0]
query_select = ""
query_from_select = "(select observation,"+table_names[0]+" from `"+dataset_name+"."+table_names[0]+"`) "+table_names[0]
for table_name in table_names:
    if table_name != table_names[0]:
        query_select = query_select + "," + table_name+"."+table_name
        query_from_select = query_from_select + " FULL OUTER JOIN (select observation," + table_name + " from " + "`"+dataset_name+"."+table_name+"`) "+table_name+" on "+table_names[0]+".observation="+table_name+".observation"
query_from_select = " from ("+query_from_select + ")"
query_where = " where " + table_names[0] + ".observation IS NOT NULL"
query_order_by = " order by observation"

query_full = query_start+query_select+query_from_select+query_where+query_order_by

with open("query.sql","w") as f:
    f.write(query_full)

And this is a sample of the generated query for two tables:

select 
  VARIABLE1.observation, 
  VARIABLE1.VARIABLE1, 
  VARIABLE2.VARIABLE2 
from 
  (
    (
      select 
        observation, 
        VARIABLE1 
      from 
        `sample_dataset.VARIABLE1`
    ) VARIABLE1 FULL 
    OUTER JOIN (
      select 
        observation, 
        VARIABLE2 
      from 
        `sample_dataset.VARIABLE2`
    ) VARIABLE2 on VARIABLE1.observation = VARIABLE2.observation
  ) 
where 
  VARIABLE1.observation IS NOT NULL 
order by 
  observation

As the number of tables grows, this query gets larger and larger. Any suggestions on how to improve the performance of this operation? Any other way to approach this problem?

How to merge one thousand tables within a BigQuery dataset?

Answers (1)

Related Questions