rish0097
rish0097

Reputation: 1094

Audit records while working with streaming data in Apache Beam

I have a use case wherein records will be published from an on-premise system to a PubSub topic. Now, I want to make sure that all records published are read by the Apache Beam job and they are all correctly written to BigQuery. I have two questions regarding this: 1) How do I make sure that there is no data loss in the entire process? 2) I need to maintain an Audit table somewhere to make sure that if 'n' records were published I have dumped each one of them successfully. How to keep track of the records?

Thank You.

Upvotes: 1

Views: 244

Answers (1)

Scott Wegner
Scott Wegner

Reputation: 7493

Google Cloud Dataflow guarantees exactly-once data processing, with transactional logic built into its sources and sinks. You can read more about exactly-once guarantees in the blog article: After Lambda: Exactly-once processing in Cloud Dataflow, Part 3 (sources and sinks).

For your question about an audit table: can you describe more about what you'd like to accomplish? Dataflow has built-in Elements Added counters available in the UI and API which will show exactly how many elements have been processed. You could match this up with the number of published Pub/Sub messages.

Upvotes: 1

Related Questions