lknguyen
lknguyen

Reputation: 23

Set priority for pipeline google dataflow

I'm new in google dataflow. I have 2 dataflow pipeline to execute 2 difference job. One is ETL process and load to Bigquery and another one is read from Bigquery to aggregate for report. I want to run pipeline ETL firt and after it complete the reports pipeline will run to make sure data in bigquery is latest update.

I had tried to run in one pipe line but it can't help. Now I have to run manual for ETL first and then I run report pipeline.

Can any body give me some advice to run 2 job in one pipeline. Thanks.

Upvotes: 0

Views: 230

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

You should be able to do both of these in a single pipeline. Rather than writing to BigQuery and then trying to read that back in and generate the report, consider just using the intermediate data for both purposes. For example:

PCollection<Input> input = /* ... */;
// Perform your transformation logic
PCollection<Intermediate> intermediate = input
  .apply(...)
  .apply(...);
// Convert the transformed results into table rows and
// write those to BigQuery.
intermediate
  .apply(ParDo.of(new IntermediateToTableRowETL())
  .apply(BigQueryIO.write(...));
// Generate your report over the transformed data
intermediate
  .apply(...)
  .apply(...);

Upvotes: 2

Related Questions