Ali SAID OMAR
Ali SAID OMAR

Reputation: 6792

insertId equivalent for bq command line

I'm making some tests to avoid doublons during insert. I have noticed rows[].insertId could permit to avoid doublons but it's seems the bq command line have no such parameter. I have tried with --undefok but with no effect.

 bq --apilog= --show_build_data insert --insert_id=201603210850 --template_suffix=_20160520  --dataset_id=mydataset --undefok=insert_id MYTEMPLATE.table myjson.json

have I missing something ?

Upvotes: 0

Views: 845

Answers (1)

Pentium10
Pentium10

Reputation: 207912

AFAIK the insert_id is only taken into account on streaming insert bases, not load jobs.

And it's not a switch, it's a value of the row being ingested.

https://cloud.google.com/bigquery/streaming-data-into-bigquery#before_you_begin

Manually removing duplicates

You can use the following manual process to ensure that no duplicate rows exist after you are done streaming.

1) Add the insertID as a column in your table schema and include the insertID value in the data for each row. 2) After streaming has stopped, perform the following query to check for duplicates:

SELECT max(count) FROM(
  SELECT <id_column>, count(*) as count
  FROM <table>
  GROUP BY id_column)

If the result is greater than 1, duplicates exist.

3) To remove duplicates, perform the following query. You should specify a destination table, allow large results, and disable result flattening.

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY <id_column>)
          row_number,
  FROM <table>
)
WHERE row_number = 1

Notes about the duplicate removal query:

  • The safer strategy for the duplicate removal query is to target a new table. Alternatively, you can target the source table with write disposition WRITE_TRUNCATE.
  • The duplicate removal query adds a row_number column with the value 1 to the end of the table schema. You can select by specific column names to omit this column.
  • For querying live data with duplicates removed, you can also create a view over your table using the duplicate removal query. Be aware that query costs against the view will be calculated based on the columns selected in your view, which can result in large bytes scanned sizes.

Upvotes: 1

Related Questions