Reputation: 23
I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.
Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.
Upvotes: 2
Views: 6153
Reputation: 75715
Thank you for the precision.
First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode
Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.
If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).
In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.
Solution 1
Solution 2
However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*
). Like this all the files which match with the prefix will be loaded.
Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.
Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!
Upvotes: 9