How to replace timestamp-partitioned table data in BigQuery?

Question

The problem I'm trying to solve is removing duplicates from a particular partition as referenced by a TIMESTAMP type column. My table is something like the schema below with the timestamp column partition having day-based granularity:

requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING

Now I have millions and millions of these and sometimes there are duplicates like this:

'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange

The uniqueness of the records is determined by two fields: requestID and recordNo. I'd like to remove the duplicates in the partition where CAST(ts AS DATE) = '2020-06-10'. I can see the distinct records with a simple select:

SELECT DISTINCT * FROM mytable WHERE CAST(ts AS DATE) = '2020-06-10'

There must be a way to combine a delete/update/merge with the select distinct so that I can replace the partition with the de-duplicated data.

Thoughts?

How to replace timestamp-partitioned table data in BigQuery?

Answers (1)

Related Questions