Reputation: 2592
Currently I'm loading data from Google Storage
to stage_table_orders
using WRITE_APPEND
. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp
tells which row is the most updated one.
then I WRITE_TRUNCATE
my production_table_orders
with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders
always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE
20M rows every 3 minutes.
Suggestion?
Upvotes: 2
Views: 629
Reputation: 1554
We are doing the same. To help improve performance though, try to partition the table by date_purchased
and cluster by orderid
.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders
for the imported records and final_table_orders
as destination table and do
a MERGE
like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG
, or EXISTS
with DELETE
or UNION ALL
,... Try them out and see which performs better for YOUR dataset.
Upvotes: 3