gana
gana

Reputation: 175

Multiple google-dataflow and dataproc jobs

I have multiple google-dataflow jobs for data collection and ETL purposes. and then google dataproc job (Spark) for further machine learning.

I would like tie these jobs together like workflow then i should be able schedule the whole workflow.

do you have some suggestion/products which can help me ?

Upvotes: 2

Views: 1133

Answers (2)

gana
gana

Reputation: 175

We have implemented 2 approaches for this...

  1. Custom solution for invoking dataproc jobs. This include Spring scheduler to invoke Dataproc & dataflow using google Sdk API

  2. One dataproc jobs running in streaming mode and this streaming mode dataproc jobs manages other dataproc and dataflow jobs. We send the message to pub-sub and streaming mode receive the message and then invoke further chain.

I will prefer 2nd solution over 1st because we have manage Spring application using cloud formation etc

2nd solution comes with extra cost of running dataproc jobs for 24*7.

Upvotes: 0

Frances
Frances

Reputation: 4041

I don't know of any great answers on GCP right now, but here's a couple of options:

  • use Google App Engine task queues
  • use the following pattern to trigger a DataProc job after your Dataflow job completes: Use Create to create a dummy PCollection with a single element. Write a ParDo over that collection where the body of the DoFn contains java code that calls your DataProc job. Because it's processing an collection containing one element, it will execute once (modulo retries). Take the final output of your Dataflow job, process it with a ParDo that outputs nothing and gives you an empty PCollection. Pass that PCollection in as a side input into your ParDo that calls DataProc. In other words, use a fake data dependency to force ordering between the body of your Dataflow job and a final step that that creates the DataProc job.

Upvotes: 1

Related Questions