Reputation: 6792
I'm making some tests to avoid doublons during insert. I have noticed rows[].insertId could permit to avoid doublons but it's seems the bq command line have no such parameter. I have tried with --undefok
but with no effect.
bq --apilog= --show_build_data insert --insert_id=201603210850 --template_suffix=_20160520 --dataset_id=mydataset --undefok=insert_id MYTEMPLATE.table myjson.json
have I missing something ?
Upvotes: 0
Views: 845
Reputation: 207912
AFAIK the insert_id
is only taken into account on streaming insert bases, not load jobs.
And it's not a switch, it's a value of the row being ingested.
https://cloud.google.com/bigquery/streaming-data-into-bigquery#before_you_begin
Manually removing duplicates
You can use the following manual process to ensure that no duplicate rows exist after you are done streaming.
1) Add the insertID as a column in your table schema and include the insertID value in the data for each row. 2) After streaming has stopped, perform the following query to check for duplicates:
SELECT max(count) FROM(
SELECT <id_column>, count(*) as count
FROM <table>
GROUP BY id_column)
If the result is greater than 1, duplicates exist.
3) To remove duplicates, perform the following query. You should specify a destination table, allow large results, and disable result flattening.
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY <id_column>)
row_number,
FROM <table>
)
WHERE row_number = 1
Notes about the duplicate removal query:
Upvotes: 1