coelho
coelho

Reputation: 50

How to merge one thousand tables within a BigQuery dataset?

I have a dataset within BigQuery with roughly 1000 tables, one for each variable. Each table contains two columns: observation_number, variable_name. Please note that the variable_name column assumes the actual variable name. Each table contains at least 20000 rows. What is the best way to merge these tables on the observation number?

I have developed a Python code that is going to run on a Cloud Function and it generates the SQL query to merge the tables. It does that by connecting to the dataset and looping through the tables to get all of the table_ids. However, the query ends up being too large and the performance is not that great.

Here it is the sample of the Python code that generates the query (mind it's still running locally, not yet in a Cloud Function).

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set project_id and dataset_id.
project_id = 'project-id-gcp'
dataset_name = 'sample_dataset'
dataset_id = project_id+'.'+dataset_name

dataset = client.get_dataset(dataset_id)

# View tables in dataset
tables = list(client.list_tables(dataset))  # API request(s)
table_names = []

if tables:
    for table in tables:
        table_names.append(table.table_id)
else:
    print("\tThis dataset does not contain any tables.")

query_start = "select "+table_names[0]+".observation"+","+table_names[0]+"."+table_names[0]
query_select = ""
query_from_select = "(select observation,"+table_names[0]+" from `"+dataset_name+"."+table_names[0]+"`) "+table_names[0]
for table_name in table_names:
    if table_name != table_names[0]:
        query_select = query_select + "," + table_name+"."+table_name
        query_from_select = query_from_select + " FULL OUTER JOIN (select observation," + table_name + " from " + "`"+dataset_name+"."+table_name+"`) "+table_name+" on "+table_names[0]+".observation="+table_name+".observation"
query_from_select = " from ("+query_from_select + ")"
query_where = " where " + table_names[0] + ".observation IS NOT NULL"
query_order_by = " order by observation"

query_full = query_start+query_select+query_from_select+query_where+query_order_by

with open("query.sql","w") as f:
    f.write(query_full)

And this is a sample of the generated query for two tables:

select 
  VARIABLE1.observation, 
  VARIABLE1.VARIABLE1, 
  VARIABLE2.VARIABLE2 
from 
  (
    (
      select 
        observation, 
        VARIABLE1 
      from 
        `sample_dataset.VARIABLE1`
    ) VARIABLE1 FULL 
    OUTER JOIN (
      select 
        observation, 
        VARIABLE2 
      from 
        `sample_dataset.VARIABLE2`
    ) VARIABLE2 on VARIABLE1.observation = VARIABLE2.observation
  ) 
where 
  VARIABLE1.observation IS NOT NULL 
order by 
  observation

As the number of tables grows, this query gets larger and larger. Any suggestions on how to improve the performance of this operation? Any other way to approach this problem?

Upvotes: 0

Views: 1206

Answers (1)

rtenha
rtenha

Reputation: 3628

I don't know if there is a great technical answer to this question. It seems like you are trying to do a huge # of joins in a single query, and BQ's strength is not realized with many joins.

While I outline a potential solution below, have you considered if/why you really need a table with 1000+ potential columns? Not saying you haven't, but there might be alternate ways to solve your problem without creating such a complex table.

One possible solution is to subset your joins/tables into more manageable chunks. If you have 1000 tables for example, run your script against smaller subsets of your tables (2/5/10/etc) and write those results to intermediate tables. Then join your intermediate tables. This might take a few layers of intermediate tables depending on the size of your sub-tables. Basically, you want to minimize (or make reasonable) the number of joins in each query. Delete the intermediate tables after you are finished to help with unnecessary storage costs.

Upvotes: 2

Related Questions