Reputation: 74
In the Dataflow FAQ, it is listed that running custom (cron) job processes on Compute Engine is a way to schedule dataflow pipelines. I am confused about how exactly that should be done: how to start the dataflow job on compute engine and start a cron job.
Thank you!
Upvotes: 1
Views: 1309
Reputation: 403
You can use the Google Cloud Scheduler to execute your Dataflow Job. On Cloud Scheduler you have targets, these could be HTTP/S endpoints, Pub/Sub topics, App Engine applications, you can use your Dataflow template as target. Review this external article to see an example: Schedule Your Dataflow Batch Jobs With Cloud Scheduler or if you want to add more services to the interacion: Scheduling Dataflow Pipeline using Cloud Run, PubSub and Cloud Scheduler.
Upvotes: 1
Reputation: 5276
I have this working on App Engine, but I imagine this is similar for Compute Engine
Cron will hit an endpoint on your service at the frequency you specify. So you need to setup a request handler for that endpoint that will launch the dataflow job when hit (essentially in your request handler you need to define your pipeline and then call 'run' on it).
That should be the basics of it. An extra step I do is I have the request handler for my cron job launch a cloud task and then I have the request handler for my cloud task launch the dataflow job. I do this because I've noticed the 'run' command for pipelines sometimes taking a while and cloud tasks have a 10 minute timeout, compared to the 30s timeout for cron jobs (or was it 60s).
Upvotes: 1