Joshua G
Joshua G

Reputation: 249

bigquery : data versioning and incremental updates

Is there any documentation on how to deal with APPEND only database such as BigQuery, for real time analytics use (e.g. 100 million to 1 billion mobile user profile which are updated daily), and to minimize down time to a few seconds.

(1) There is a post http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/ on how to implement incremental updates, and I am wondering if similar solution exist in BigQuery.

If I need to attach a version column (e.g. version 1, 2, 3) etc, is there a quick way to calculate the version?

(2) How to minimize downtime to a few seconds I like the solution mentioned in Eliminating duplicate records in a BigQuery Table and I think it solves my challenge partially.

If mydataset.mytable has 100 million to 1 billion unique e.g. mobile user records, and downtime of less than a few seconds is not tolerable, is there anyway to eliminate the downtime? e.g. Virtual view? During the "bq query --replace" and "bq query cp", can the table mydataset.mytable accessible, or will the table be locked until the replace and query is completed. Thank you.

bq query --allow_large_results --replace --destination_table=mydataset.mytable \
'SELECT * FROM mydataset.mytable
 WHERE key NOT IN (SELECT key FROM mydataset.update)'

bq cp --append_table mydataset.update mydataset.mytable

Upvotes: 5

Views: 5508

Answers (1)

Sean Chen
Sean Chen

Reputation: 671

The first part of your question actually sounds like this:

Is there any other approach for updating a row in Big Query apart from overwriting the table?

This is the new "best practice" we recommend for removing duplicates (as opposed to the one linked above), and is easily adapted to allow updates (based on time, as opposed to a version). You can adapt your partition expression to select based on whatever criteria is appropriate for you.

In answer to your second question, the table is not locked at any time during the "bq query --replace". Other queries started before the completion of the "replace" query will work off a snapshot of the table at the time it was started. Queries started after the completion time of the "replace" query job will see the new data.

Upvotes: 3

Related Questions