Reject data load attempt to BigQuery for existing data

Question

I'm loading data from pandas dataframes to BigQuery using pandas-gbq package:

df.to_gbq('dataset.table', project_id, reauth=False, if_exists='append')

A typical dataframe looks like:

key      |    value    |    order
"sd3e"   |     0.3     |    1
"sd3e"   |     0.2     |    2
"sd4r"   |     0.1     |    1
"sd4r"   |     0.5     |    2

Is there a way to reject the loading attemp if the key already appears in the BigQuery table?

Tamir Klein · Accepted Answer

Is there a way to reject the loading attempt if the key already appears in the BigQuery table?

No, since BigQuery doesn't support keys in a similar way other database does. There are 2 typical use-cases to solve this:

Option 1:
Upload the data with a timeStamp and use a merge command to remove duplicates

See this link on how to do this, This is an example

MERGE `DATA` AS target
USING `DATA` AS source
ON target.key = source.key
WHEN MATCHED AND target.ts < source.ts THEN 
DELETE

Note: In this case, you pay for the merge scanning but keep your table row unique.

Option 2:

Upload the data with a timestamp and use ROW_NUMBER window function to fetch the latest record, This is an example with your data:

WITH DATA AS (
    SELECT 'sd3e' AS key, 0.3 as value,  1 as r_order, '2019-04-14 00:00:00' as ts  UNION ALL
    SELECT 'sd3e' AS key, 0.2 as value,  2 as r_order, '2019-04-14 01:00:00' as ts  UNION ALL
    SELECT 'sd4r' AS key, 0.1 as value,  1 as r_order, '2019-04-14 00:00:00' as ts  UNION ALL
    SELECT 'sd4r' AS key, 0.5 as value,  2 as r_order, '2019-04-14 01:00:00' as ts  
)

SELECT * 
FROM (
    SELECT * ,ROW_NUMBER() OVER(PARTITION BY key order by ts DESC) rn 
    FROM `DATA` 
)
WHERE rn = 1

This produces the expected results as follow:

Note: This case doesn't incur extra charges, however, you always have to make sure to use window function when fetching from the table

Reject data load attempt to BigQuery for existing data

Answers (1)

Related Questions