Reputation: 159
What are the most common reasons for Dataflow job to fail with the following message:
The work item was attempted on these workers: wn-vlg-to-1vro-1606304136-11250335-91et-harness-n08k Root cause: The worker lost contact with the service.
I've also observed multiple "Refusing to split..." prints in the worker logs:
Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f6a231c4a90> at BZt9HwAB
job_id="2020-11-25_04_58_21-4901626503823103758"
Upvotes: 0
Views: 488
Reputation: 77
A common reason for a Dataflow job to fail with “Root cause: The worker lost contact with the service” is you running out of memory.
You can identify memory issues within the Stackdriver Logs using an advanced filter like [1] (take a look at [2] too).
Possible solutions are to set higher memory worker machines types, or decrease the parallelism of processing, using the pipeline option --numberOfWorkerHarnessThreads
(or --number_of_worker_harness_threads
for Python).
[1]
resource.type="dataflow_step"
resource.labels.job_id="YOUR_JOB_ID"
severity>=WARNING
("thrashing=true" OR "OutOfMemoryError" OR "Out of memory" OR "Shutting down JVM")
[2] https://cloud.google.com/logging/docs/view/advanced-queries#getting-started
Upvotes: 1