Roger Sanders
Roger Sanders

Reputation: 11

Google Cloud Platform Dataflow integration

Is it possible to leverage command line tools within GCP Dataflow?

Essentially, I have files that I currently pass as arguments into a command line tool, and that tool outputs a different file based on the input. I'm not exactly sure what the tool does so recreating the logic within Dataflow is out of question. Is there any way to call this tool using the os or subprocess modules while still taking advantage of Dataflow's benefits?

Upvotes: 1

Views: 131

Answers (1)

Eric Schmidt
Eric Schmidt

Reputation: 1317

Yes, you can call out to sub-processes inside of your graph. However, there are some implications to doing this. Example: Inside of your DoFn() you might do something like: shell(call legacy exe to produce flat file). At this point you will have to manually block on that call or create sometime of orchestration to process the output. There is no callback or dispatching mechanism in Apache Beam. The main side effect of this scenario is that you are now blocking the DoFn from doing anymore work - thus burning cycles just blocking. If this sub-process calls are light, probably not an issue - if they are resource intensive e.g. sequence this genome - you are going to hit some issues.

A more flexible and effective way to do this type of work is to mix Cloud Composer with Cloud Dataflow. Use Dataflow for work that needs aggregation and then dispatch long running (sub-process) work to Cloud Composer. For example: Analyze population of 1B people, find top Y people with features of X. Then dispatch long running processes to do sub-process analysis on the Y.

Does this help?

Upvotes: 1

Related Questions