Reputation: 2929
We have some data generated from our devices installed on clients' side. Duplicated data exist and it is by design, which means we wouldn't be able to eliminate duplicated ones in data generating phase. We are now looking into the possibility to avoid duplication while streaming into Bigquery (rather than clean the data by doing table copy and delete later). That's to say, for every ready-to-be-streamed record, we check whether it's already in Bigquery first, if not then we continue to stream it in, if it does exist, then we won't stream it in.
But here's the concern: (quote from [here]:https://developers.google.com/bigquery/streaming-data-into-bigquery)
Data availability
The first time a streaming insert occurs, the streamed data is inaccessible for a warm-up period of up to two minutes. After the warm-up period, all streamed data added during and after the warm-up period is immediately queryable. After several hours of inactivity, the warm-up period will occur again during the next insert.
Data can take up to 90 minutes to become available for copy and export operations.
Our data will go into different bigquery tables (the table name is dynamically generated from the data's date_time). What does "the first time a stream insert occur" mean? Is it per table?
Does the above doc mean that we cannot rely on the query result to check for duplications in the process of streaming?
Upvotes: 3
Views: 1947
Reputation: 26637
If you provide an insert id, bigquery will automatically do the deduplication for you, as long as the duplicates are within the de-duplication window. The official docs don't mention how long the de-duplicatin window is, but it is generally from 5 minutes to 90 minutes (if you write data very quickly to a table, it will be closer to 5 than 90, but if data is trickled in, it will last longer in the deduplication buffers.).
Regarding "the first time a streaming insert occurs", this is per table. If you have a new table and start streaming to it, it may take a few minutes for that data to be available for querying. Once you've started streaming, however, new data will be available immediately.
Upvotes: 3