Reputation: 1094
I have a use case wherein records will be published from an on-premise system to a PubSub topic. Now, I want to make sure that all records published are read by the Apache Beam job and they are all correctly written to BigQuery. I have two questions regarding this: 1) How do I make sure that there is no data loss in the entire process? 2) I need to maintain an Audit table somewhere to make sure that if 'n' records were published I have dumped each one of them successfully. How to keep track of the records?
Thank You.
Upvotes: 1
Views: 244
Reputation: 7493
Google Cloud Dataflow guarantees exactly-once data processing, with transactional logic built into its sources and sinks. You can read more about exactly-once guarantees in the blog article: After Lambda: Exactly-once processing in Cloud Dataflow, Part 3 (sources and sinks).
For your question about an audit table: can you describe more about what you'd like to accomplish? Dataflow has built-in Elements Added counters available in the UI and API which will show exactly how many elements have been processed. You could match this up with the number of published Pub/Sub messages.
Upvotes: 1