jimmy
jimmy

Reputation: 4881

Delta job on BigQuery misses records

I'm facing an odd issue with a small delta job that I implemented on top of a streaming BigQuery table with Apache Beam.

I'm streaming data to a BigQuery table and every hour I run a job to copy any new records from that streaming table to a reconciled table. The delta is built on top of a CreateDatetime column I introduced on the streaming table. Once a record gets loaded to the streaming table, it gets the current UTC timestamp. So the delta naturally takes all records that have a newer CreateDatetime than last time up to the current time the Batch is running.

CreatedDatetime >= LastDeltaDate AND
CreatedDatetime < NowUTC

The logic for LastDeltaDate is as follows:

1. Start: LastDeltaDate = 2017-01-01 00:00:00
2. 1st Delta Run: 
- NowUTC = 2017-10-01 06:00:00
- LastDeltaDate = 2017-01-01 00:00:00
- at the end of the successful run LastDeltaDate = NowUTC
3. 2nd Delta Run:
- NowUTC = 2017-10-01 07:00:00
- LastDeltaDate = 2017-10-01 06:00:00
- at the end of the successful run LastDeltaDate = NowUTC
...

Now every other day I find records that are on my streaming table but never arrived on my reconciled table. When I check the timestamps I see they're far away from the batch run and when I check the Google Datflow log I can see that no records where returned for the query at that time but when I run the same query now, I get the records. Is there any way a streamed record could arrive super late in a query or is it possible that Apache Beam is processing a record but not writing it for a long time? I'm not applying any windowing strategy.

Any ideas?

Upvotes: 0

Views: 518

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

When performing streaming inserts, there is a delay in how quickly those rows are available for batch exports, as described in their documentation data availability.

So, at time T2 you may have streamed a bunch of rows into BigQuery that are stored in the streaming buffer. You then run a batch job from time T1 to T2, but only see the rows up to T2-buffer. As a result, any rows that are in the buffer for each of the delta runs will be dropped.

You may need to make your selection of NowUTC aware of the streaming buffer, so that the next run process rows that were within the buffer.

Upvotes: 1

Related Questions