What is the best practice for loading data into BigQuery table?

Question

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.

then I WRITE_TRUNCATE my production_table_orders with query like:

select ...
from (
    SELECT  * , ROW_NUMBER() OVER
    (PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn 
    FROM `warehouse.stage_table_orders` )
where rn=1

Then the production_table_orders always contains the most updated version of each order.

This process is suppose to run every 3 minutes.

I'm wondering if this is the best practice. I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.

Suggestion?

yan-hic · Accepted Answer

We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid. Use a CTAS statement (to the table itself) as you cannot add partition after fact.

EDIT: use 2 tables and MERGE

Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do a MERGE like so:

MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
   F.date_purchased = S.date_purchased
WHEN MATCHED THEN
  UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
  INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)

Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.

Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.

There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.

What is the best practice for loading data into BigQuery table?

Answers (1)

Related Questions