Reputation: 20324
The problem I'm trying to solve is removing duplicates from a particular partition as referenced by a TIMESTAMP
type column. My table is something like the schema below with the timestamp column partition having day-based granularity:
requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING
Now I have millions and millions of these and sometimes there are duplicates like this:
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange
The uniqueness of the records is determined by two fields: requestID
and recordNo
. I'd like to remove the duplicates in the partition where CAST(ts AS DATE) = '2020-06-10'
. I can see the distinct records with a simple select:
SELECT DISTINCT * FROM mytable WHERE CAST(ts AS DATE) = '2020-06-10'
There must be a way to combine a delete/update/merge with the select distinct so that I can replace the partition with the de-duplicated data.
Thoughts?
Upvotes: 1
Views: 191
Reputation: 111
The safest way to do this is to select only the data (de-duplicated) you need out into a new table, delete the data in your permanent table, then insert your de-duplicated data back into the permanent location. BigQuery does not make update/delete methods as easy as some OLTP databases.
If you would prefer a more one-shot approach, here is an example with the data you provided that does the trick.
-- SETUP
CREATE TABLE working.remove_dupes
(
requestID STRING,
ts TIMESTAMP,
recordNo INT64,
recordData STRING
)
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR);
INSERT INTO working.remove_dupes(requestID, ts, recordNo, recordData)
VALUES
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 2, 'orange'),
('server1234', '2020-06-10', 2, 'orange');
------------------------------------------------------------------------------------
-- SELECTING ONLY ONE OF THE ENTRIES (NO DUPLICATES)
SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1;
------------------------------------------------------------------------------------
-- REPLACE THE ORIGINAL TABLE, REMOVING DUPLICATES IN THE PROCESS
-- BACK UP YOUR TABLE FIRST!!!!! (MAKE A COPY)
CREATE OR REPLACE TABLE working.remove_dupes
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR)
AS
(SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1);
EDIT: Note that replacing the table can (in my experience) wipe out table metadata (descriptions) and possibly the table partition. I've updated the example to include a table partition setup.
Upvotes: 2