Neil C. Obremski
Neil C. Obremski

Reputation: 20324

How to replace timestamp-partitioned table data in BigQuery?

The problem I'm trying to solve is removing duplicates from a particular partition as referenced by a TIMESTAMP type column. My table is something like the schema below with the timestamp column partition having day-based granularity:

requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING

Now I have millions and millions of these and sometimes there are duplicates like this:

'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange

The uniqueness of the records is determined by two fields: requestID and recordNo. I'd like to remove the duplicates in the partition where CAST(ts AS DATE) = '2020-06-10'. I can see the distinct records with a simple select:

SELECT DISTINCT * FROM mytable WHERE CAST(ts AS DATE) = '2020-06-10'

There must be a way to combine a delete/update/merge with the select distinct so that I can replace the partition with the de-duplicated data.

Thoughts?

Upvotes: 1

Views: 191

Answers (1)

Woodrow K.
Woodrow K.

Reputation: 111

The safest way to do this is to select only the data (de-duplicated) you need out into a new table, delete the data in your permanent table, then insert your de-duplicated data back into the permanent location. BigQuery does not make update/delete methods as easy as some OLTP databases.

If you would prefer a more one-shot approach, here is an example with the data you provided that does the trick.

-- SETUP
CREATE TABLE working.remove_dupes
(
  requestID STRING,
  ts TIMESTAMP,
  recordNo INT64,
  recordData STRING
)
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR);

INSERT INTO working.remove_dupes(requestID, ts, recordNo, recordData)
VALUES
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 2, 'orange'),
('server1234', '2020-06-10', 2, 'orange');

------------------------------------------------------------------------------------
-- SELECTING ONLY ONE OF THE ENTRIES (NO DUPLICATES)
SELECT
  requestID,
  ts,
  recordNo,
  recordData
FROM (
  SELECT
    requestID,
    ts,
    recordNo,
    recordData,
    ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
  FROM
    working.remove_dupes
)
WHERE
  instance_num = 1;


------------------------------------------------------------------------------------
-- REPLACE THE ORIGINAL TABLE, REMOVING DUPLICATES IN THE PROCESS
-- BACK UP YOUR TABLE FIRST!!!!! (MAKE A COPY)
CREATE OR REPLACE TABLE working.remove_dupes
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR)
AS
(SELECT
  requestID,
  ts,
  recordNo,
  recordData
FROM (
  SELECT
    requestID,
    ts,
    recordNo,
    recordData,
    ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
  FROM
    working.remove_dupes
)
WHERE
  instance_num = 1);

EDIT: Note that replacing the table can (in my experience) wipe out table metadata (descriptions) and possibly the table partition. I've updated the example to include a table partition setup.

Upvotes: 2

Related Questions