Reputation: 151
I recently started a Dataflow job to load data from GCS and run it through DLP's identification template and write the masked data to BigQuery. I could not find a Google-provided template for batch processing hence used the streaming one (ref: link). I see only 50% of the rows are written to the destination BigQuery table. There is no activity on the pipeline for a day even though it is in the running state.
Upvotes: 0
Views: 741
Reputation: 21
yes DLP Dataflow template is a streaming pipeline but with some easy changes you can also use it as batch. Here is the template source code. As you can see it uses File IO transform and poll/watch for any new file in every 30 seconds. if you take out the window transform and continuous polling syntax, you should be able to execute as batch.
In terms of pipeline not progressing all data, can you confirm if you are running a large file with default settings? e.g- workerMachineType, numWorkers, maxNumWorkers? Current pipeline code uses a line based offsetting which requires a highmem machine type with large number of workers if the input file is large. e.g for 10 GB, 80M lines you may need 5 highmem workers.
One thing you can try to see if it helps is to trigger the pipeline with more resources e.g: --workerMachineType=n1-highmem-8, numWorkers=10, maxNumWorkers=10 and see if it's any better.
Alternatively, there is a V2 solution that uses byte based offsetting using state and timer API for optimized batching and resource utilization that you can try out.
Upvotes: 2