Reputation: 604
So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:
Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
So here I am :-) The jobId: 2017-08-29_13_30_03-3908175820634599728
Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?
I'm working with Dataflow-Version 1.9.0
Thanks Google Dataflow Team
Upvotes: 1
Views: 294
Reputation: 1731
It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).
Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.
Upvotes: 1