Tadayasu Yotsu
Tadayasu Yotsu

Reputation: 159

How to test Dataflow Pipeline with BigQuery

I'd like to test my pipeline. My pipeline extract data from BigQuery, then store data to GCS and S3. Although there are some information about pipeline test here, https://cloud.google.com/dataflow/pipelines/testing-your-pipeline, it does not include about data model of extracting data from BigQuery.

I found following example for it, but it lacks of comment, so little bit difficult to understand. https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/test/java/com/google/cloud/dataflow/examples/cookbook/BigQueryTornadoesTest.java

Are there any good documents for test my pipeline?

Upvotes: 1

Views: 2093

Answers (2)

Gabriel Hodoroaga
Gabriel Hodoroaga

Reputation: 353

In Google Cloud is easy to create end-to-end test using real resources like Pub/Sub topics and BigQuery tables.

By using Junit5 extension model https://junit.org/junit5/docs/current/user-guide/#extensions you can hide the complexity of creation and deleting of the required resources.

You can find a demo/seed here https://github.com/gabihodoroaga/dataflow-e2e-demo and a blog post here https://hodo.dev/posts/post-31-gcp-dataflow-e2e-tests/

Upvotes: 0

Alex Amato
Alex Amato

Reputation: 1725

In order to properly integration test your entire pipeline, please create a small amount of sample data stored in BigQuery. Also, please create a sample bucket/folder in S3 and GCS to store your output. Then run your pipeline as you normally would, using PipelineOptions to specify the test BQ table. You can use the DirectPipelineRunner if you want to run locally. It will probably be easiest to create a script which will first run the pipeline, then down the data from S3 and GCS and verify you see what you expect.

If you want to just test your pipeline's transforms on some offline data, then please follow the WordCount example.

Upvotes: 1

Related Questions