Running function once Dataflow Batch-Job step has completed

Question

I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.

I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.

I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.

Thanks!

Daniel Oliveira · Accepted Answer

There are two ways you could do this:

Execute it after your pipeline.

Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:

pipeline.run().waitUntilFinish();

You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.

Add transforms after FileIO.Write

The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:

WriteFilesResult result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)

The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.

Running function once Dataflow Batch-Job step has completed

Answers (2)

Execute it after your pipeline.

Add transforms after FileIO.Write

Related Questions