Reputation: 31
I'm having some trouble with a Dataflow pipeline that reads from PubSub and writes to BigQuery.
I had to drain it to perform some more complex updates. When I rerun the pipeline it started reading fom PubSub at a normal rate, but then after some minutes it stopped and now it is not reading messages from PubSub anymore! Data watermark is almost one week delayed and not progressing. There are more than 300k messages in the subscription to be read, according to Stackdriver.
It was running normally before the update, and now even if I downgrade my pipeline to the previous version (the one running before update), I still doesn't get it to work.
I tried several configurations:
1) We use Dataflow autoscaling, and I tried starting the pipeline with more powerful workers (n1-standard-64), and limiting it to ten workers, but it won't improve performance neither autoscale (it keeps only the initial worker).
2) I tried providing more disk through diskSizeGb (2048) and diskType (pd-ssd), but still no improvement.
3) Checked PubSub quotas and pull/push rates, but it's absolutely normal.
Pipeline shows no errors or warnings, and just won't progress.
I checked instances resources and CPU, RAM, disk read/write rates are all okay, compared to other pipelines. The only thing a little higher is network rates: about 400k bytes/sec (2000 packets/sec) outgoing and 300k bytes/sec incoming (1800 packets/sec).
What would you suggest I do?
Upvotes: 0
Views: 1357
Reputation: 176
The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam. Make sure you are following the documentation as a reference when you update. Quotas can be an issue for slow running pipeline and lack of output but you mentioned those are fine.
It seems there is a need to look at the job. I recommend to open an issue on the PIT here and we’ll take a look. Make sure to provide your project id, job id and all the necessary details.
Upvotes: 0