Reputation: 45

How to avoid inserting duplicate rows when primary key (ID) is randomly generated in BigQuery

I have a table with a random auto-generated id (primary key). I am trying to avoid the insertion of duplicate rows.

Example of a duplicate row:

id  | field a | field b | field c |
1       4          6         7
2       4          6         7

The key (id) is not duplicate since it is generated with uuid, but all other fields are identical.

I guess I'm looking for somehting like this but in BigQuery language: Avoiding inserting duplicate rows in mySQL

Upvotes: 1

Answers (3)

Michael Entin

Reputation: 7744

You can use insert into ... where not exists ... query, and it is fine if you do it rarely. But it is kind of anti-pattern if you do it often.

This query needs to scan the table the row is inserted into, so it might get slow and expensive as this table becomes larger. Partitioning and clustering might help, but still if you insert a lot of rows one at a time, this might get costly.

A more common approach is to insert anything, and periodically do deduplication.

Upvotes: 2

Gordon Linoff

Reputation: 1270091

You can use not exists if you want to avoid inserting duplicate rows into the table:

insert into t (id, a, b, c)
    select i.*
    from (select 2 as id, 4 as a, 6 as b, 7 as c) i
    where not exists (select 1
                      from t
                      where t.a = i.a and t.b = i.b and t.c = i.c
                     );

Upvotes: 1

A___

Reputation: 81

To help protect your table against duplication, set the insertId property when sending your request.

BigQuery uses the insertId property for de-duplication.

new BigQueryInsertRow(insertId: "row1"){
...
},
new BigQueryInsertRow(insertId: "row2") {
...
}

Upvotes: -1

How to avoid inserting duplicate rows when primary key (ID) is randomly generated in BigQuery

Answers (3)

Related Questions