cahen
cahen

Reputation: 16676

Pipeline initialization step in Google Dataflow

I need to clear a table before the pipeline gets its input data and I'd like this step to run as part of the pipeline itself, in the cloud, not locally.

This is what the code looks like at the moment and clearTable() runs locally:

    exactTargetIntegration.clearTable(); // runs locally
    Pipeline p = Pipeline.create(options);
    PCollection<String> readFromFile =
        p.apply(TextIO.Read.from(INPUT_FILES)); // runs in the cloud
    ...

Is it possible?

Upvotes: 1

Views: 447

Answers (1)

danielm
danielm

Reputation: 3010

There is not currently a way to ensure that some action takes place before a read within the same pipeline. If you need your operation to run in the cloud, you can run it as a separate pipeline that runs before the first.

For example:

DataflowPipelineOptions options = ...
options.setRunner(BlockingDataflowPipelineRunner.class);
Pipeline deletePipeline = <build deletion pipeline>
Pipeline mainPipeline = <build main pipeline>
deletePipeline.run(options);
mainPipeline.run(options);

Additionally, this use case is definitely something that we'd like to support; you can track the issue here: https://issues.apache.org/jira/browse/BEAM-65

Upvotes: 2

Related Questions