hamdog
hamdog

Reputation: 1181

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?

Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.

Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.

Upvotes: 0

Views: 1036

Answers (2)

Jake Biesinger
Jake Biesinger

Reputation: 5828

BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

Upvotes: 0

Alexey Maloletkin
Alexey Maloletkin

Reputation: 1099

Case you describe required having two level partitioning which is not supported yet

You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value

Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.

You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.

Examples for gcloud python library

client = bigquery.Client(project="PROJECT_NAME")

t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)

rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records

# stream records to destination 
errors = client.create_rows(target_table, records_to_stream)

Upvotes: 1

Related Questions