Trevor Lazarus
Trevor Lazarus

Reputation: 159

Google Cloud Dataflow Job - Refusing to split

What are the most common reasons for Dataflow job to fail with the following message:

The work item was attempted on these workers: wn-vlg-to-1vro-1606304136-11250335-91et-harness-n08k Root cause: The worker lost contact with the service.

I've also observed multiple "Refusing to split..." prints in the worker logs:

Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f6a231c4a90> at BZt9HwAB

job_id="2020-11-25_04_58_21-4901626503823103758"

Upvotes: 0

Views: 488

Answers (1)

Sarri&#243;n
Sarri&#243;n

Reputation: 77

A common reason for a Dataflow job to fail with “Root cause: The worker lost contact with the service” is you running out of memory.

You can identify memory issues within the Stackdriver Logs using an advanced filter like [1] (take a look at [2] too).

Possible solutions are to set higher memory worker machines types, or decrease the parallelism of processing, using the pipeline option --numberOfWorkerHarnessThreads (or --number_of_worker_harness_threads for Python).

[1]

resource.type="dataflow_step"
resource.labels.job_id="YOUR_JOB_ID"
severity>=WARNING
("thrashing=true" OR "OutOfMemoryError" OR "Out of memory" OR "Shutting down JVM")

[2] https://cloud.google.com/logging/docs/view/advanced-queries#getting-started

Upvotes: 1

Related Questions